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Abstract 

The aim of this paper is to study different estimation procedures based on 95—divergences. The dual represen¬ 
tation of yj—divergences based on the Fenchel-Legendre duality is the main interest of this study. It provides a 
way to estimate ip—divergences by a simple plug-in of the empirical distribution without any smoothing technique. 
Resulting estimators are thoroughly studied theoretically and with simulations showing that the so called mini¬ 
mum (^—divergence estimator (MD(/ 9 DE) is generally non robust and behaves similarly to the maximum likelihood 
estimator. We give some arguments supporting the non robustness property, and give insights on how to modify 
the classical approach. An alternative class of (p—divergences robust estimators based on the dual representation 
is presented. We study consistency and robustness properties from an influence function point of view of the new 
estimators. In a second part, we invoke the Basu-Lindsay approach for approximating (^—divergences and provide 
a comparison between these approaches. The so called dual (p—divergence is also discussed and compared to our 
new estimator. A full simulation study of all these approaches is given in order to compare efficiency and robust¬ 
ness of all mentioned estimators against the so-called minimum density power divergence, showing encouraging 
results in favor of our new class of minimum dual (p—divergences. 


Introduction 

The maximum likelihood method is a simple and an efficient method to estimate unknown parameters of a given 
model. The most common drawback of such method is its sensibility to contamination and misspecification. From 
the first years of the twentieth century, many researchers such as Pearson, Hellinger, Kullback and Liebler, Neymann 
and others started using different approaches using distant-like functions between probability density functions called 
as divergences. Resulting estimators have shown a good robustness against outliers. Nowadays, we have several 
divergence-based techniques which perform well under noise presence such as (p—divergences (Csiszar [1963], Ali and 
Silvcy [1966]), S'—divergences (Ghosh et al. [2013]), Renye pseudodistances (see for example Toma and Lconi-Aubin 
[2013]), Bregman divergences and many others. We are particularly interested here in this paper in (p—divergences 
and in comparing it with maximum likelihood (calculated using EM algorithm for mixtures) and some particular 
cases of S—divergences and Bregman divergences. 

We define a (p—divergence in the sense of Csiszar [1963] as follows. Let (p : [0, 00 ) —>■ (0, 00 ) be a proper closed 
convex function. Let P and Q be two probability measures defined on the same measurable space (A, B) such that 
Q is absolutely continuous with respect to P. Denote dQ/dP the corresponding Radon-Nikodym density. The 
(p—divergence between Q and P is defined by: 

D^{Q,P) = j ip{^{y)^dP{y) (1) 

If Q is not absolutely continuous with respect to P, we set D^{Q,P) = 00 . For the class of Cressie-Read 
^ ^ we get the power divergences which contain the Hellinger (7 = 0.5), the Pearson (7 = 2), 

the Neymann x^ (7 = —1) and other classical divergences. 

When working with discreet models, (p—divergences are simply approximated using the empirical distribution P„ 
since both the model and the empirical distribution are absolutely continuous with respect to the Dirac measure. 
Efficient and robust estimators were derived and extensively studied, see for example Simpson [1987] and (Lindsay 
[1994]). 

For continuous models, the empirical distribution is no longer suitable to replace directly the true distribution since 
the model has a continuous support. Thus, the model cannot be absolutely continuous with respect to P„ and no 
interesting estimation procedure is produced (see Broniatowski and Vajda [2012] for a proper explanation). Authors 
such as Bcran [1977] proposed to simply smooth the empirical distribution using kernels. Basu and Lindsay [1994] 
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proposed to smooth both the model and the empirical distribution in order to avoid consistency conditions and rates 
of convergence imposed on the kernel estimator (provided the existence of a transparent kernel). Their method can 
be reread in some basic examples as the calculus of the (^—divergence between a kernel estimator and a weighted 
version of the model pg = cte pg. Although the smoothed model may result in a loss of information, Basu and 
Lindsay show that this loss is rather small. They also admit that there is still a difficulty in the choice of the window 
and the kernel for the smoothing since providing transparent kernels for a given model is a hard task^. 

Recently, an approach based on some convexity arguments have been proposed by Liese and Vajda [2006] and Bro- 
niatowski and Kcziou [2006]. In both articles, the authors provide similar "supremal" representations of (/?—divergences 
where a simple plug-in of the empirical distribution is possible without any smoothing techniques. The resulting 
estimators were called as minimum dual ip— divergence estimators (MD(/5DE). Since their appearance, no complete 
study about the robustness of such estimators were proposed except for the calculus of the influence function in 
Toma and Broniatowski [2011]. There were even no simulation studies either, except for the paper of Frydlova et al. 
[2012]. However, the authors have considered only the case of normal model where the MD(^DE is proved to coincide 
with the maximum likelihood estimator, see Broniatowski [2014]. Although they get a robust estimator in only one 
case^, we believe that it was due to a calculation error which we explain later. 

The dual representation proposed by both Liese and Vajda [2006] and Broniatowski and Keziou [2006] performs well 
under the model. It even coincides with the MLE in full exponential families, and hence have the same efficiency as 
the MLE. Weak and strong consistency is reached under classical conditions (see Broniatowski and Kcziou [2009]). 
Limit laws of the MD(^DE and the estimated divergence are simple and were exploited to build statistical tests. 
However, when we are not under the model, this approach seems to be inconvenient and suffer from lack of robust¬ 
ness. When we are in contamination models or under misspecification, this approach does not approximate well the 
(^—divergence between the empirical distribution and the model. It even remarkably underestimates its value. We 
propose in this paper a brief explanation of this problem and provide a general solution. We also give two particular 
solutions. The first is based on kernels which avoids the supremal form (hence no double optimization). The second 
is devoted to contamination models which appears as a slight modification of the classical MD(pDE. We study the 
consistency and the robustness from an influence function point of view of the kernel-based estimator and check 
corresponding conditions on simple examples. 

In a second part of this paper, we briefly recall the Basu-Lindsay approach (Basu and Lindsay [1994]) and discuss 
what happens in the context of densities defined on (0,oo). We show that symmetric kernels are not suitable and 
provide some solutions through asymmetric kernels. We also discuss some of the positive and drawbacks of the 
so-called dual (/?-divergence estimator (Broniatowski and Kcziou [2009]) which is another estimator derived through 
the dual representation of the divergence. The sensitivity to the choice of the escort parameter is invoked. We 
compare this estimator with the density power divergence of Basu et al. [1998] and show a strong relation between 
these two methods. 

Finally, we provide several simulation results in a simple gaussian model, a mixture of two gaussian components, a 
generalized Pareto distribution (GPD) and a mixture of two Weibull components. We make a comparison with the 
classical MD(/3DE, the MLE (calculated using EM for the mixture case), the Basu-Lindsay approach, the so called 
dual (/3—divergence estimator, Beran’s approach and the minimum power density divergence (MPD) when the data 
is drawn under the model and when it is contaminated by 10% of observations from other distributions. Our new 
estimator is as efficient as MLE under the model and is robust against outliers. Although not being the best robust 
estimator in simple examples (but close enough), it shows promising performances in difficult ones conquering other 
methods. 

The paper is organized as follows. In Section 1, we give a theoretical introduction of the dual representation 
of (^—divergences. We explain the problem of the existing approach based on duality and introduce a solution to 
robustify the classical MD(/jDE. Section 2 is devoted to the asymptotic properties of our kernel-based MD(/3DE where 
a set of conditions ensuring consistency are given and verified on a gaussian model. The influence function is also 
calculated and proved to be bounded on a simple example. In Section 3, we recall the Basu-Lindsay approach and 
see how it is applied when one uses asymmetric kernels. In Section 4, we discuss some of the positive points of our 
new estimator in comparison to the classical MD(/5DE and the Basu-Lindsay approach. We also enlist some of its 
drawbacks. The so called dual p divergence estimator is discussed in Section 5 and we show a convergence with the 
density power divergence introduced in Basu et al. [1998]. In Section 6, we give another estimator for contamination 
models and discuss briefly its properties. Finally, Section 7 contains an extensive simulation study and a comparative 
discussion with other estimators presented in the paper. 

^The authors provide however, three standard examples which admit transparent kernels; they are the gaussian the gamma and the 
poisson models. 

^They get robust results when adding outliers drawn from a Cauchy distribution. 
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1 The dual representation of divergences 

1.1 The theoretical approaches 

Liesc and Vajda [2006] propose the following "snpremal" representation of (^—divergences. Let 7^ be a class of mutually 
absolutely continuous distributions such that for any triplet P,Pt and Q, (p'{dPT/dQ) is P-integrable. Theorem 17 
in Liese and Vajda [2006] states that: 


D^{Pr.F) = sn^j ^•[PyPT + j •p{P)dP- j •p-(^§)dQ ( 2 ) 

and the supremum is attained when Q = Pt- Broniatowski and Kezion [2006] have also developed a similar and a 
more general representation of D^{P, Pt). Let P be some class of S—measurable real valued functions. Let A 4 jr be 
the subspace of the space of probability measures A4 defined by Mf = {P & Ai\ J \f\dP < oo,V/ G P}. Assume 
that (fi is differentiable and strictly convex. Then, for all P G A4f such that D^p{P,Pt) is finite and (p'{dP/dPT) 
belongs to P, the (/?—divergence admits the dual representation (see Theorem 4.4 in Broniatowski and Kezion [2006]); 


D^{P,Pt) = sup [ fdP- f ^*if)dPT, (3) 

where ip*{x) = sup^gj^ta; — (pit) is the Fenchel-Legendre convex conjugate. Moreover, the supremum is attained at 
/ = dP/dPT. 

When substituting P by the class of functions {p'{dP/dQ)}, and using the property we 

obtain the same representation given above in (2). Both formulations (2) and (3) are interesting in their own and in 
their proofs. The second formula gives us the opportunity to reproduce many snpremal forms for the (^—divergence. 
In a parametric setup where dP^ = p^dx for </) G 4> C and the true distribution generating the data is a member 
of the model, i.e. Pt = P^t for some G 4), Broniatowski and Kezion [2006] propose to use the class of functions 
P 4 , = W{P 4 >/Pa),ct G 4*}. The dual representation of is now written as: 


DviP4>^P^T) = sup 



{x)p^{x)dx - 


Pa \PaJ \PaJ 


iy)p<pT {y)dy 


(4) 


The idea behind this choice is that the supremum is attained when a = (jP". Since p^T is unknown, one think about 
replacing p^rdy by the empirical distribution. This seems very natural and does not cause any problem of absolute 
continuity as in formula (1). We now get: 


D<p{p<j>,P<l>T) = sup 



{x)p^{x)dx - 



Z =1 


Pa \PaJ 




(5) 


This quantity is "nearly" the divergence between the empirical distribution and the model. Both Broniatowski and 
Kezion [2006] and Liese and Vajda [2006] propose to estimate the set of parameters (jP" by: 


= arginf sup D^{p^,p^.^) 


( 6 ) 


This was called by Broniatowski and Kezion [2006] as the minimum dual (^—divergence estimator (MD(/3DE) who 
have also studied the asymptotic properties and provided sufficient conditions for the consistency of this estimator. 
They have also built some test statistics based on it. Toma and Broniatowski [2011] and Broniatowski and Vajda 
[2012] have studied the robustness of such an estimator from an influence function (IF) point of view. The IF is 
unfortunately unbounded in general and does not even depend on ip for the classe of Cressie-Read functions 
presented in the introduction. This fact is still not sufficient to conclude the non robustness of the MD(/5DE. It was 
pointed out by many authors in the context of (/?—divergences that one may have an unbounded influence function, 
still the resulting estimators enjoy a good robustness against outliers, see Beran [1977] for the hellinger divergence 
in continuous models and Lindsay [1994] for a general class of (^—divergences in discrete models. 

Till this day, and to the best of our knowledge, there is not even a simulation study of the robustness of the MD(^DE 
although it is an estimator which, similarly to the power density estimator of Basu et al. [1998], does not require 
any smoothing or escort parameters. Besides, the asymptotic properties are proved with merely classical conditions 
on the model. The only simulation study, to our knowledge, is done by Frydlova et al. [2012] and focuses only on 
the normal model. In their results, the MD(^DE has comparative results to the maximum likelihood estimator when 
no contamination is present, while they get some cases where the MD(/3DE is robust under contamination, although 
they should not as we will see later in the following paragraph. 
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1.2 Does the MD(^DE have any chance to be robust? 

Equality with MLE in exponential families. An important aspect about the classical MD(^DE is that it coin¬ 
cides with the maximum likelihood estimator in full exponential models whenever the corresponding true divergence 
D^ is finite, see Broniatowski [2014]. This covers the standard gaussian model for which Frydlova ct al. [2012] pro¬ 
vided clear robust properties of the MD(/3DE when outliers are generated by the standard Cauchy distribution. This 
contradicts with the theoretical result presented in Broniatowski [2014] which is an exact result and depends only on 
analytic arguments. We have done similar simulations and found out that numerical problems may play a nice role 
here. Fortunately, we have no numerical integration since all integrals can be easily calculated, see Frydlova ct al. 
[2012] or Broniatowski and Vajda [2012]. When using the standard Cauchy distribution to generate outliers, we get 
points with very large values superior to 100. These points participate only in the sum term in the MDf/jDE (6). A 
gaussian density with parameters not very far from the standard ones (/r = 0, tr = 1) will produce a value equal to 0 
in numerical computer programs. Thus, numerical problems of the form 0/0 would appear when calculating the sum 
term in (5) since the summand is of the form g{pe/Pa){yi)- If one uses simple practical solutions to avoid this, such 
as adding a very small value (e.g. 10“^°°) to the denominator or the nominator, a thresholding effect is produced 
and the true fraction is badly calculated. As a result, such outliers would have practically no effect in the procedure 
as if they were not added, and one would obtain "forged robust estimates". The same thresholding effect does not 
happen in the MLE since the likelihood function does not contain any fractions. On the other hand, if one calculates 
the fraction using the properties of the exponential function, i.e. peiui)/Pa{yi) = exp[(?/i — ol)^ j2 — (yi — cj))‘^/2], the 
MD(/?DE defined by (6) gives the same result as the maximum likelihood estimator and never better^. 

We have performed further simulations on several models which do not belong the exponential family and found out 
that the MD(/?DE have a very similar behavior to the MLE, see Sect. 7 below. This should not be very surprising 
because of the convergence between exponential families and a large class of probability laws. Papers such as Barron 
and Shell [1991] discussed how one can estimate a probability density using an exponential families and proved 
interesting convergence rates. 


Why should not it work well although being an estimator of a divergence which is proved to be 
a robust procedure (see Donoho and Liu [1988])? We do not pretend to give a full answer about the non 
robustness of the MD(/j. Our argument here is intuitive. When Pt is a member of the model, the approximated dual 
formula converges to the (^—divergence, and the argument of the infimum to the corresponding one, as the number of 
observations increases. This consistency was discussed in Proposition 3.1 in Broniatowski and Kcziou [2009]. Their 
result, however, does not hold when Pt is not a member of the model, i.e. under contamination or misspecification. 
Indeed, consistency is in the following sense: 

Dn{P<t>, Pt) sup |y V?' {x)p^{x)dx - j T* {y)dPTiy)dy^ , 

and the arginf of the left hand side to the arginf of the right hand side. The limiting quantity is the dual representation 
of the (/i—divergence, and since the supremum is attained uniquely when pa = dPx / dy, then it is never attained as 
long as Pt is not a member of the model. Moreover, the limiting quantity is a lower bound of the divergence and 
minimizing the former does not guarantee the minimization of the later. Figure 1.2 represent this idea on a standard 
gaussian model where the mean is unknown. The true distribution is contaminated by a gaussian distribution 
Af {p = 10, CT = 2). The minimum of the dual representation is attained at ^ = 1 whereas it is attained at 0 for the 
true divergence. Figure (a) shows formula (4) and figure (b) shows formula (5). The data contains 100 observations. 
We also represent the solution introduced in the following paragraph which overcomes this problem. 

1.3 New reformulation of the dual representation 

As stated previously, when the data is contaminated, the supremum in (5) is not attained and the approximation 
of the divergence between the model and the empirical distribution^ is dramatically degraded. Since in the not 
approximated formula (4) or (2), the supremum is attained uniquely whenever pa = Pt, an intuitive idea is to replace 
Pa by an adaptive (nonparametric) estimator of pt which does not take into account the restriction of being in 
the model. We, then, have a dual representation where the supremum is, nearly, attained whether we are under 
the model or not. This way our criterion should inherit robustness properties against possible contamination as it 
approximates a (^—divergence. 

One should be able to propose many solutions which correspond to this idea in order to reach a supremal attainment 
in the dual representation which may vary depending on the situation. For example, if we face a proportion of 

®On the basis of 100 experiments, there were about 20 experiments where the MDi/jDE suffered from numerical complications and 
exploses higher than the MLE. 

^Although this quantity is not well defined in the continuous case, the plug-in of the empirical distribution in the dual representation 
gives an idea about the divergence between the model and the empirical distribution. 
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Figure 1: Underestimation caused by the classical dual representation compared to the new one. The true distribution 
is taken to be 0.9A/'(/r = 0, cr = 1) + Q.lN{^JL = 10, cr = 2). Figure (a) shows the dual representation defined by (4) 
in comparison with the new reformulation defined by (7). Figure (b) shows the corresponding approximations when 
we replace the true distribution by its empirical version. 


large-values outliers, one may add an extra component to Pc, i.e. replace Pa with the mixture \pa + (1 — A)gg. 
The extra component covers the outliers part in a smooth way. This suggestion is still very specific and treats 
only the case of contaminated data. Any nonparametric estimator of pt can be used whose parameters may be 
determined automatically in the supremum calculus (since the supremum will be over the window parameter). In 
what follows Kn^w denotes a kernel estimator® of pr defined using a symmetric or asymmetric kernel with or without 
bias-correction treatment. 

In order to introduce our new MDpDE, let’s go back to the beginning. We restart from formula (3) and use the 
following class of functions J-e,n = {pe/> 0}. The dual representation is now given by: 


D^{P 4 „Pt) = sup 

it ;>0 




Nn.Ai 


{x)p^{x)dx - 


P<p 

K 




P<t> 

K 




_P±_ 

Nn.Ai 


{y)pT{y)dy 


(7) 


The supremum calculus will produce a window for which the kernel Kn^w^-pt is the closest (in some sense) to p^t. 
Now, we approximate Dip{p^,p^^) by: 


D^{P4>,Pt) ~ / < p ' 






{x)p^{x)dx - 


K, 


tyWopt 


P4> 


Kp, 


I'^opt 




P4> 




)'^opt 


{y)pT{y)dy 


Since px is the unknown object we hope to estimate, we replace it by its empirical version. Our final approximation 
is given by: 


Dx{P4>^Pt) = I p' 


P^ 


Kr, 


{x)p<p{x)dx -V 

71 


2=1 


p^yi) f p<p 

Kn,Wppt (jji) \ Kn,w„ 


Define now the new minimum dual (p—divergence estimator by: 


= arginf [ p' (^ {x)p^{x)dx - - V] 

J \Ku,wppJ 


P<t>{yi) , 

KriyWppt (j/i) V ^n, 


P4> 


{yi) - P 


{yi) - P 


P<t> 




P4> 


K, 


n,Wopt 


(yi) 


(yi) 


( 8 ) 


An important question which arises now is: what should be the value of rcopt since its calculus demands knowing 
the true distribution? For the time being, we do not have any specific propositions for the choice of the window. 
Taking into consideration that the window should be chosen in order to copy the true distribution, one needs a good 
kernel estimator. In the literature of kernel estimation, there exists many rules (automatic or not) to determine 

^In formula (7) which comes next, the kernel function should not have a compact support such as the Epanechnikov kernel for the 
sake of integration existence. This is only temporary, and as we define the new estimator of the c/?—divergence, the integral is replaced 
by a Monte-Carlo average where the kernel is only calculated on observed data and we get rid of the integration problem, and thus the 
use of a compact support kernel becomes possible. 
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sub-optimum windows such as the (Silverman’s or Scott’s) rule-of-thumb, cross-validation methods, etc. See for 
example Venables and Ripley [2013] Chap 5. Figure 1.2 shows in a gaussian example contaminated by a gaussian 
component A/'(10,2) the use of Silverman’s rule with a gaussian kernel. The classical dual representation clearly 
underestimates the true divergence whereas the new reformulation stays close to it. 


Remark 1 The new MDipDE keeps the MLE as a member of its class for the choice of ip(t) = — log(t) +t — 1. 
Indeed, ^p'{t) = —l/t+ 1 and tip'(t) — ip(t) = log(t). Thus: 


1 

n 


E 




P4> 




{x)p4,{x)dx 


p4y^) ^/ 


Kn,w„j,tiyi) \^n, 


P<t> 


{y^) - P 


P4> 




(yi) 


and thus: 


1 

1 " 

-^log(p^(?/ 0 )-log {Kn .■UlaptiVi)) 
^ i=l 


^ IL ^ IL 

= argini I--J2'^og{p4,{y^)) + 

1 " 

= argsup - Vlog(p0(y*)) 

'/'S'l’ ” 7^1 

= MLE 


Remark 2 Replacing pt by the empirical distribution in (7) should not be a way to calculate an automatic window 
for the kernel. Indeed, although the proof of the dual representation supposes mutual absolute continuity between 
Pa,P4> and pr, one still expects that the attainment condition of the supremum (pa = Pt) should hold as we replace 
Pt by the empirical distribution. Indeed, if we insert directly Kn,w in (5) instead of Pa, the maximization becomes 
on the window w, and the supremum will always be attained for w = 0. When the kernel estimator is calculated by 
convolution, recall that Kn^w = Kw * Pn ^ Pn as w goes to zero. 


2 Asymptotic properties and robustness of the new reformulation 

We present in this section some of the asymptotic properties of the new MDtpDE defined by ( 8 ). We use Theorem 
5.7 from the book of Van Der Vaart [1998] which we restate here. Consistency of the kernel-based MD(^DE means 
that (fn defined by ( 8 ) converges in probability to the true vector of parameters when we are under the model, i.e. 
Pt = Pij,T. If we are not under the model, consistency becomes with respect to the projection of Pt on the model 
in the sens of the divergence. In other terms, the projection P^t is the member of the model P^ whose parameters 
are defined by fA = arginf Dy,{P^, Pt). 

Similarly to Basu and Lindsay [1994], there are some cases (which are rare) in which consistency of the kernel-based 
MD(^DE does not need any condition on the kernel window. Thus, one may find simpler versions of the results 
we give below. We will see that a gaussian model with unknown mean is one of these examples where we give the 
corresponding conditions. 

In a second part of this section, we calculate the influence function of the kernel-based MDf/jDE for a given window, 
and show how the use of a kernel estimate instead of the model Pa in the dual formula interferes to make the IF 
bounded. 

We use the same notations as in Van Der Vaart [1998] to note integration. Thus, if / is a P—integrable function, we 
denote Pf to the integral J fdP. Moreover, the notation K.„,*P denotes the operation of smoothing dP by the kernel 
with bandwidth equal to w. This smoothing can be done by simple convolution as in the case of Rosenblatt- 
Parzen kernel estimator. Other kinds of smoothing are presented in Section 3. The smoothing is supposed to be an 
additive operator on distributions in the sense that Ky, * {P ± Q) = Ky, * P ± Ky, * Q. 

2.1 Consistency 

Theorem 5.7 from Van Der Vaart [1998] permits to treat the consistency of a general class of M-estimates. It is 
stated as follows: 

Theorem 3 Let be random functions and let M be a fixed function of (j) such that for every e > 0 

P 

sup \Mn{(j)) - M{(j))\ 0, 

inf MU) > MU^). 


(9) 

(10) 
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Then any sequence of estimators (fn with Mn{(j)n) < Mn{(j)^) — op(l) converges in probability to . 

In our approach, function Mn corresponds to the criterion function PnH{Pn,(l>), where H{Pn,(l>,y) is defined by: 


H{Pn,(j),y) 




_P^_\ 

Kw * Pn ) 


{x)p^{x)dx - 


p^y) f ( p<k \ 

Kw * Pn{y) V Kw * Pn) 


{y) - T 


P4> \ 

Kui * Pn) 



Function M is simply defined by the expected^ limit in probability of M„, since the Law of Large Numbers cannot be 
used because the average term is not a sum of i.i.d. random variables. It is given by P^wh^P^T, (p) where h(P^T, (/>, x) 
is defined as: 


h{P^T,(j),y) = f (p'f {x)p4,{x)dx - — ^ (y) - p (— ^ {y) 

J \P<t>-^ J VP<t>’^{y) \P<j>'r J \Pcl>r J 

In order to prove (9), we propose to divide the argument into two parts. One can write: 

sup \PnH{Pn,(j)) - P^Th{P^T,4>)\ < sup \PnH{Pn,4>) - Pnh{P^T, 4>)\ + SUp \P^T h{P^T, p) - Pnh{P^T, 

0G4> 


( 11 ) 


Now, the second supremum tends to 0 in probability by the Glivenko-Cantelli theorem as soon as function </> i—>■ 
h{P^T,(j>) is P^T-integrable, or more generally if {h{P, (p), p G d)} is a Glivenko-Cantelli class of functions. The 
problem then resides in finding conditions under which the first supremum tends to 0 in probability. The remaining 
of the paragraph will be concerned with the search for such conditions. In the whole section concerning the consistency 
of our new estimator the window parameter w is suppposed to depend directly on n in order to be able to use Theorem 
1 without any modifications. Besides, the construction of the estimator from (7) shows the explicit link of the window 
with n. 

The following results are arranged in a way to give at first the most general case which one can offer. This result 
shows, according to our proof, that it is very difficult to derive a general and an applicable result in the same time. 
The ideas we provide are useful however to derive particular results according to a given divergence. We treat after 
that the case of divergences based on the Gressie-Read class of functions. Gonditions of our result for this case still 
seem very restrictive. We finally discuss two particular cases when 7 is either in the interval (0,1) or (—1, 0). Simpler 
conditions are derived and then verified in the gaussian model when we use a gaussian kernel. 


2.2 General Result 


We will derive in this paragraph a result which concerns the general class of divergence functions p. Hereafter, 
simpler conditions will be proved for the particular class of Gressie-Read functions p-y. Let be the function 
= tp'ft) — p(t), we then have: 


PnH(P„,P} - Pnh(P,<P) = J 




P<i> 


K,ii * Pn 




P4> 

P^r 


{x)p^{x)dx 


--E 


p-r. 


P4> 


(yi) -T* iy^)■ 


The key idea is to treat each term (the integral and the sum) separately and prove its uniform convergence in 
probability towards 0. Another important step is to apply the mean value theorem in order to transfer the difference 
from functions p' and p"^ into a difference between the kernel estimator and the true distribution where consistency 
of the former is exploited. We state now our general result: 

Theorem 4 Assume that: 

1. function 1 1 —>■ p{t) is twice differentiable; 

2. the kernel estimator is strongly consistent, i.e. sup^, \Kyj * Pn{x) — P0t(x)| —>■ 0 in probability; 

3. function x 1 —>• 1 ^# ( p t%) ) (^) PT—integrable for any (j> in 

4- for any £ > 0, there exists no such that Vn > no, the probability that the quantity 


An = sup 
0 ■ 


pli.x) 


p^T{x)Kn, * Pn{x) 


p” Ai(a:) 


P4> 

Kn, * Pr 


ix) + {l-\i{x))^{x)]dx 
P<pr 


is upper bounded independently of n, where Ai(a:) G (0,1), is greater than 1 — rjn for ijn —> 0; 


In the literal sense and not mathematically. 
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5. for any £ > 0, there exists uq such that Vn > uq, the prohahility that the quantity 


1 " 

Bn = sup - 

± n 


P4> 




(t*)' (x2{yz) p + (1 - ^2{yi))—\ 
^ \ Kni*P„ P^T J 


is upper hounded, where X 2 {yi) G (0,1), is greater than I — rjn for ijn 0; 

6- inf0:||0_0T||>g P'r/i(PT, <?!>) > PTh{PT,(f^)! 

then the minimum dual ip—divergence estimator defined hy (8) is consistent whenever it exists. 

Proof. Let e > 0. We want to prove that lim„_>oo P (sup^g^ |P„iL(P„, </>) — I <£) = !• Since p is 

twice differentiable (which also implies the differentiability of then by the mean value theorem, there exist two 
functions Ai, A 2 : K —>■ (0,1) such that: 




P,p 


K,„ * P„ 


P 


P<j> 

P<j>r 


= p” [ + {I - Xi{x)) 


Kw * Pn 


P<t>^ 


P<j> 


P<l> 


Kw * Pn P(j,T 


P4,r 




P4>r 


P<t>'^. 


Let n be sufficiently large such that: 


sup \Kn, * Pn{x) -p,pT{x)\ < min ( e, ) 

X \ •An Ofi J 


where An and Bn are as described in 4 and 5 in the theorem. Provided that constants An and Bn exist and are 
bounded independently of n, this event occurs with probability 1 — rjn with ? 7 „ —> 0 by the strong consistency 
assumption (point 2). This implies that both events: 


P 


P<p 


K,,, * Pn 


P 


P<l> 

P<t>r 


P4> 


< 


vl 


■^n J ^ Pr 


-'t” ( Ai(a^) ... p + (1 - ) dx 


Kw * Pn 


Pcl>i 


< e, 


E 


p 


# 




{yi) - p* f—) (y*' 


< 


S 1 


E 


p<j> 


Sn n ^ P^rKni * 


~ {p*)' {x2{yi)-f^^^^’^^ + (1 - X2(jJi))—\ 

Pn \ Kn,*Pn P<pT J 


< e 


happen with probability greater than 1 — rjn independently of (j). Finally, we conclude that 


P sup |P„iL(P„,((>) - Pnh{P,(j))\ <2e] >1-T]n, 

\^ 0 g<e> j 

and hence sup,^g^ |P„iL(P„, (f) — Pnh{P, 0)| —0 in probability. To end the proof, we use assumption 3 of the present 
theorem together with the Glivenko-Cantelli theorem to conclude that sup^g^ |P 0 t/i(P 0 t, 0) — Pnh{P^T, (j))\ —>■ 0 in 
probability. Using inequality 11, we conclude that sup,^g^ |P„iL(P„,0) — P^t h{P,j,T , (j))\ —0 in probability. We end 
with the use of Theorem 1. Condition (9) is verified by the previous arguments, and Condition (10) is what we have 
assumed in point 6 of the present theorem. By definition of the kernel-based MD(/ 3 DE as a minimum of the criterion 
function (j) 1 —> P„P(P„, (jf) Theorem 1 entails the consistency of our new estimator. ■ 

This result is very general since function p is only supposed to be twice differentiable ' . For consistency results 
one can consult for example Wied and WcibBach [2012], Zambom and Dias [2013] or Libengue Dobcle-kpoka [2013] 
Chap. 1 for a brief survey on symmetric kernels. If one is using asymmetric kernels, unfortunately consistency is 
only proved on every compact subset of the support of the distribution function, see Bouezmarni and Scaillet [2005] 
or Libengue Dobele-kpoka [2013] Chap. 3 for a more general approach. On the other hand, it is not simple to verify 
conditions 4 and 5 for the general class of functions p, and one may derive for his own case study a simpler set of 
conditions on the basis of this result. For condition 4, if one is using for example the divergence, p"{t) = 1 so that 
function Ai is no longer there and the expression of An is simplified. The main subtlety in condition 5 is that the 
sum is over strongly dependent random variables. We will see in the case of divergences with p = p^ for 7 G (—1,0) 
that this sum becomes over only i.i.d. random variables and is simple to be taken care of. 

Assumption 6 means that function (j) 1 —> P^h^PT, 4>) has a unique and well separated minimum. Uniqueness is 
already in our hands since function (f 1 —>■ Pxh^PT, (ff) is non other than the dual representation (with the supremum 
calculated) of the tp—divergence (p,^,). Using the property that D^{p,j),p,pT) = 0 iff p^ = p^t, uniqueness is 
immediately verified as long as the model is identifiable. 

^Recall that ip should also verify other conditions related to the notion of (^—divergences as mentionned in the introduction. 
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2.3 General Result for Power Divergences 

Power divergences are the divergences defined through the class of Cressie-Read functions defined by: 

n — yt + 7 — 1 


(Py(t) = 


l(l - 1 ) 


The kernel based MDc/jDE is defined as: 

^„ = arginf ^ 


7 - 1 2 {Ky, * Pn) 




p<i>{yi) 


n'y * PniVi) 


The idea here is to get rid of from many places and replace the derivatives of p' and with a more explicit 
formulas. The idea of using the mean value theorem is kept, but this time it is applied on simpler functions. Here 
we have: 


1 r(Ky,* - p]a'^ 1 ” ( r :„ * - p~? 

PnHiPr,,^) - Pnh{P, 4 >) = -^ / —^ 

7-1 7 p^^ «7t^ 


2=1 


— 1 
P* 


Theorem 5 For the class of power divergences defined through the class of Cressie-Read functions assume that: 

1. the kernel estimator is strongly consistent, i.e. sup^, \Kyj * Pn{x) — P 0 t(x)| — >■ 0 in probability; 

2. function x i—>• t%) ) Pr—integrable for any f) in <&; 

3. for any £ > 0, there exists no such that Vn > no, the probability that the quantity 

A ...... f [\ l { x ) Ky ,* Pr ,{ x ) + { l -\ i { x )) p ^ T { x )\ ^ j _ 

•^n — sup / —N/ \ 

^ J P4> W 

is upper bounded independently of n, where Ai(a:) € (0,1), is greater than 1 — rjn for pn —t 0; 

4- for any £ > 0, there exists no such that Vn > no, the probability that the quantity 

1 [^2{yi)Kw * Pn{yi) + (1 - ^2{yi))p,j>T{yi)\ ^ ^ 

^ (y*) 

is upper bounded independently of n, where \ 2 {yi) £ (0,1), is greater than 1 — Pn for pn —t 0; 

5- inip.\\,j,-,j,T\\>E PTh{PT,(l>) > PTh{PT,(j>^), 

then the minimum dual p—divergence estimator defined by (8) is consistent whenever it exists. 

Proof. Let x and a yf 0,1 be real numbers. By the mean value theorem, there exists A(a:) £ (0,1) such that: 

{Ky, * P„)“ (x) -p'^t{x) = a [A(x)R:^ * Pnix) + (1 - A(x))p,^t(x)]“ ^ {Ky, * P„{x) -p,pT{x)) 

This implies both identities: 

(Ky, * Pnr'^^^ (x) - p-f+\x) [Ai(x)iV^*P„(x) + (l-Ai(x))p^T(x)]-^ 

= ( 1 - 7 )-^- =777 -^— X (P:„*P„(x)-p^t(x)) 


{Ky, * Pn)~'^ {y^) - pTt (yi) 


Let n be sufficiently large such that: 


= -7- 


[^2{yi) Ky,*Pn{yi) + {l-\2{yi))p<t,T{yi)\ ^ ^ 

P~i^{yi) 


X [Ky, * PniVi) - P^T (yi)) 


sup \Ky, * P„(x) -p^t(x)| < min [ £, ] 

X \ -A-n J 


where An and Bn as defined in the theorem. This event occurs with probability greater than 1 — pn with > 0 by 
the strong consistency of the kernel. The remaining of the arguments is the same as for Theorem 2. ■ 
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Quantities An and Bn can be simplified according to the range of values of 7 in a way that Ai and A 2 play no 
role in the calculus. These functions have no explicit formulas in general and the only information we have in hand 
is that they take their values in (0,1). Indeed, if 7 > 0, using the convexity of function t 1 —)■ over IR_|_ and the 
fact that both quantities Ai(x) and 1 — Ai(a;) are upper bounded by 1, we may write: 


An < sup 


[K^ * Pnjx)] 


-7 


-dx ■ 


' P4,T (a;) 


-7 




-dx. 


If 7 < 0, one may use the increasing property of function 11 —> t '•'on IR+ to deduce that: 

'■ [Knj * Pn{x) + p^T (a;)] ^ 


An < sup 




-dx. 


Moreover, for values of 7 in (—1,0), we may use Jensen’s inequality to go further and write: 

An < sup 


Knj* Pn{x) +P^t{x) ^ 


P4>ix) 

Similar upper bounds can be established for Bn- When 7 > — 1,7 yf 0,1, we use again the convexity of function 
t !->■ to get the following upper bound: 

Bn < sup - > - , -■ 

^7^1 P4> {Vi) 

Finally, when 7 < — 1, we use the increasing property of function 1 1 —)■ over IR_|_. We get: 

1 +P4>'^{Vi)\ ^ ^ 

Si sup / 

, rv) 


p^^iyi) 

2.4 Case of power divergences with 7 G (0,1) 

In this paragraph, we try to derive simpler conditions than those in Theorems 2 and 3 . For the class of Cressie-Read 
defined by (p.y with 7 G (0,1), the kernel-based MD(/?DE has the form: 


arg inf 


1 


(Kn,*Pn)^ ^ (x)pj(x)dx - — 


7 - 1 


p,p{yt) 


* PniVi) 


1 


7(7 - 1 ) 


• ( 12 ) 


Our main problem is always the study of the difference P„i7(P„, (jj) — P„ft,(P, </>) which is given by: 


PnH{PnA) - Pnh{PA) = 


7-1 


{Kyj * Pn)^ ^ - p]j,T^ {x)p^{x)dx 


n 

--E 

71'N f ^ 


P<t> 


ivt) (p 0 T(y*) - [Kyj * Pn^{y^)y 


Tl'y * Pn ^ P4>T , 

The key idea here is to use the uniform continuity of both functions t ^ C and 1 1 —> . 

Theorem 6 For power divergences defined by (p.y with 7 G (0,1), suppose that: 

1. the kernel estimator is strongly consistent, i.e. sup,,, \Kyj * Pn{x) — p 0 T(a;)| —> 0 in probability; 

2. function x 1 —>■ ( 7777 )) (x) is Px—integrable for any (j> in <&; 

3. the quantity A = sup^ / p'^{x)dx is upper bounded; 

4 . for any e > 0, there exists no such that Vn > no, the probability that the quantity 


Bn = sup - 'V 

A 


i=l 


P^ 


Kw * Pn 




iy^) 


is upper bounded independently of n is greater than 1 — ijn for rjn —>■ 0 ; 
5- inf0:||,^_0T||>g P'r/i(PT, <()) > PTh{PT,f^), 
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then the minimum dual ip—divergence estimator defined hy (12) is consistent whenever it exists. 

Proof. In order to prove consistency of the kernel-based MDtpDE, we follow the same steps in Theorem 2. To Verify 
condition (9), we use the decomposition (11). The second term in (11) goes to zero in probability using assumption 
2 of the present theorem and the Glivenko-Cantelli theorem. In order to treat the first term, we use the uniform 
continuity of both functions t <—>■ O and 1 1—>■ t^~''h If 

\p<j,T{x) - Pn{x)\ < 5i{e), 

then 

* Pnr (x) - {p^T^ {x)\ < ^. (13) 

where Bn is as given here above. On the other hand, uniform continuity of 11—> entails that if 

\ptj>T{x) - Kn,* Pn{x)\ < 52{e), 


then 


{Kni^Pnf ^{x)-{p^tY '^(a;) 


< 


e(l- 7 ) 


Pl{x)dx' 


Let n be sufficiently large such that 


(14) 


sup \p^t{x) - Km* Pn{x) \ < min ((5i(e), (52(e)). 

X 

By the strong consistency assumption for the kernel estimator, this event happens with probability greater than 
^ ~ Vn where rjn 0. Thus, each of (13) and (14) occur with probability greater than 1 — where rjn —> 0. 

The remaining of the argument is exactly the same as for Theorem 2. ■ 

This result is clearly far more simpler than the one given in Theorem 2. Condition 3 here is already independent 
of n and is deterministic which is not the case for assumption 4 in theorem 2 and assumption 3 in Theorem 3. 
Although assumption 4 does not contain unknown functions such as A2 defined for previous results, it has always a 
similar difficulty since it concerns a sum of strongly dependent terms. 


2.5 Case of power divergences with 7 G (—1,0) 

This time we will use the uniform continuity of function 1 1—> to prove that if \Km * Pn{x) —p^t{x)\ < 82 , then: 

£ 

\Km*Pn{x) -P^t{x)\ < - 1 ^ . 

SUP0 n l^PAVi) 

Thus, Bn of Theorem 3 is now replaced by the quantity 


1 " 

Bn = sup - '^p'liyi). 


2=1 


On the other hand, we rewrite the integral difference as follows: 


{Km*Pnr^^^ {x)-p-J+\x) 


dx = 


{Km * Pn) " (a;) - p.T (x) 


(Km*Pn) (x)+P^T (x) 


dx. 


J P^'^ix) J Pi*{x) 

Now, using the uniform continuity of function® 1 1—> t , we may deduce that if \Km * Pn{x) — p^t{x)\ < (5i, then: 

\Km * Pn{x) -P^t{x)\ < - 


sup^f 


—>-+1 —^+1 

{Kw^Pn) 2 (a;)+p 2 (fc) 




Thus An of Theorem 3 is now replaced by the simpler quantity: 


An = sup 


{Km * Pn) " {x) + P^T {x) 

p(j>''{x) 


The remaining of the argument follows similarly to previous theorem. We may state the following result. 


^notice that G ( 0 , 1 ) since 7 G (— 1 , 0 ). 
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Theorem 7 For the class of power divergences defined through the class of Cressie-Read functions assume that: 

1. the kernel estimator is strongly consistent, i.e. sup^, \\Ky„ * Pn{x) — p^t{x)\\ — > 0 in probability; 

2. function x i—>■ t%) ) PT—integrable for any in 

3. for any e > 0, there exists no such that Vn > uq, the probability that the quantity 


An = sup 


{Kyj^Pn) " {x)+pA {x) 




is upper bounded independently of n is greater than 1 — rjn for rjn — 0 ; 

4-. for any £ > 0, there exists no such that Vn > no, the probability that the quantity 


1 " 

Bn = sup-^pj(yi) 


is upper bounded independently of n is greater than 1 — rjn for rjn 0 ; 

5- inf0:||^_^T||>£ PT^(-Pr, <?i’) > PtHPt,^}^), 

then the minimum dual ip—divergence estimator defined by (8) is consistent whenever it exists. 

This last result is the least complicated one among previous ones, since in the one hand, there is no unknown functions 
such as Ai or A 2 . On the other hand, the sum in Bn is over i.i.d. terms. According to the model and to the value 
of 7 in the interval (0,1), one may either use the results of Theorems 3 or 4. The two general results are clearly 
restrictive, and one should for his own particular case study derive his set of conditions. Those results stay as a guide 
to proving further ones. 

The remaining of this section is devoted to show how in a gaussian model, consistency of the kernel-based MD(^DE 
can be proved. 

Example 8 We take a simple and ordinary example of a gaussian model with unknown mean p which is supposed to 
be in a close interval [/imin, Mmax] • We consider power divergences for which 7 G (—1,0). The gaussian kernel is used. 
Assumption 1 is easily checked by considering the list of conditions in Theorem A in Silverman [1978]. Assumption 
2 is also very simple since 

We use Theorem 5 to prove consistency. We calculate constants An and Bn. By the strong consistency of the kernel 
estimator, it suffices for An to study boundedness of the term which contains p^T. 


— T + l 

■ P^T {x) 


= Ci(7)e2(i+T)^ 


for a constant ci. This quantity is bounded since p, is supposed to be in a closed interval. Hence An is bounded and 
assumption 3 is now verified. 

On the other hand, in order to study Bn, it suffices to consider the quantity sup,^ f p^j^p^T by vertue of the Glivenko- 
Cantelli theorem^. We have: 

j P^P^r = C2(7)e"T+^*r. 

for a constant c^. Here again, since p is supposed to be in a closed interval, the previous quantity is bounded. This 
entails that Bn is bounded and assumption f is fulfilled. 

We move now to the last assumption. By the dual representation of the divergence, we have PTh{PT, (jf) = D,p{p,f,,p,^T). 
This implies that : 

This function clearly verifies assumption 5 since it has a minimum ai // = 0 and this minimum is well separated. 


®The Glivenko-Cantelli theorem states that both quantities sup^ ^ l J ^'^re uniformly close for sufficiently 

large n independently of 0, hence boundedness of either of them implies boundedness of the other. 
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Remark 9 The argument eoncerning the boundedness of the term f * Pn) Igiven above is not precise. 
We will give a more accurate one for the interested. By Jensen’s inequality, one may write: 




iv) 


dy = 


-27 
1 — 7 


{y)dy 


< 


Kw * Pn 


P^ 


-27 
1 — 7 


{y)dy 


n 


< — V, 

\ nw 
\ 1=1 


exp 


1 ( 1 


27 

2 \w'^ ' 1 — 7 





We calculate each integral separately: 


exp 


1 1 , 27 \ 2 , / yi , 27/1 

I , ' y ' ' 


dy = wcsi^, w) exp 


2 \w‘^ 1 — 7 / 1~7 

where C3{'j,w) = . We now proceed to estimate the sum over i: 


[y^ + ^y] 

2^2 (1 + ^) 


1 _ f _ 1 1 27£(_\ 1 V_ yi 

— \ e ^ / exp dy = — > e ^03(7 ,il?) exp 

nw J n 

i—l i—1 


[y^ + ^p) 

2w^ ( 


2 1 


1 + 


2710 ^ 

1-7 


2m^7^Li^ - 7 2 I 27U 

-^ ^ V • H- Vi 

1-7^1 ' l-7+27m2i'^ 


— 03(7, > e 

n 

2=1 


_L -* ^max ^ 7 ..2 i 27^t;;jin;i 

< —03(7, --i' + 27 u>^) \ g 1 -^+27 u>2 ^7 

r> 


2=1 


The final step is to use a version of the law of large numbers for independent random variables sueh as the two series 
theorem of Kolomogrov (see Feller [ 1971 ] Chap Vlfi Theorem 3 ) since the terms of the sum do not have the same 
probability law, but guided by the standard gaussian law. The general term of the mean and its first two moments 
are: 


Zr 

E[Z,] 

E[Zf] 


7 2 I 27/imin 

= exp - yi+z -—5—^ 

1 — 7 1 — 7 + Z7?ii^ 

1 - 7 7 ymin 


yi 


1-7 

1 + 7 

1-7 

1+37 


exp 


exp 


1 + 7 1 — 7 + 2 ^w^ 
1 - 7 IjPmiL 


1 + 37 1 — 7 + 2 jw‘^ 


applies and the average ^ e ^ 


The variance exists only when 7 G (—3,0). It results that for this range of values, the Kolomogrov two series theorem 

I y . 

-T » i-t+2t'™^ ‘ now converges in probability independently of /i. Besides, the 
remaining factorc3{‘^,w)eN-N)(WNWWU also converges as n goes to infinity (and w goes to zero) to a constant (equal 

1 — 7 

to 1). Thus, boundedness of J lAul ^y ig ensured. 

P (f) yV) 


Example 10 Let’s take again the example of a gaussian model with a mean parameter p unknown. Consider the 
class of power divergences with 7 € ( 0 , 1 )- We verify assumptions of Theorem ). We suppose also that the true 
distribution is the standard gaussian law Af{ 0 , 1 ). Let’s consider a gaussian kernel. We have: 


Kw * Pt{x) 


1 

7271(1 + 


^2 

2(1 + ™^) . 


In this example, it suffices to study consistency of the kernel-based MDipDE for a fixed window. Indeed, the minimum 
of Prh^PT, Ip) concides with the minimum of function PtH{Pt, p). This entails that if for a fixed window, the kernel- 
based MDipDE is consistent with respect to the minimum of PtH{Pt,P), so does it with respect to the minimum of 
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PTh{PT,(l>) which is the true parameter p?". The corresponding list of conditions can easily be derived from Theorem 
4- Indeed, consistency of the kernel is no longer needed Points 2 and 3 are kept as they are. We replace p,pT 
in assumption 4 by the smoothed distribution * P^t . Point 5 becomes with respect to PTH{PT,(j)) instead of 
PTh{PT,4>). The arguments of the proof are the same. We only need to use the Glivenk-Cantelli theorem instead of 
the strong consistency of the kernel. Notice that point 4 is very hard so that we only verify it when w'^ > ^. 

We first check our claim that both PtH{Pt, p) and Prh^PT, p) have the same minimum. The minimum of PTh{PT, p) 
is attained when p = 0. We calculate an exact form of function PtH{Pt, p). We have: 


f pI{x) 
Jwplf^ix) 
Plix) 


dx = 


1 + W 2 - 2 

g 2{l + ^raii) 


1 + 


pIt{x) 


PT{x)dx = 


-e 2(1 + (t' + 1 )™ 2 )^‘ 


(7 + l ) w ^ + 1 


and thus: 


PtH{Pt,p) = 


7 — 1 y 1 + 7w^ 


1 + ^2 _ "'(1-"') 2 1 

_g 2(1 +v™2)^‘- 


1 + 


+ 1 )^^ + 1 


_ + 2 
g 2(1 + (7 + 1)™2) V _ 


7 ( 7 - 1 )' 


Figure 2 shows the curve of this function for several values of w and 7 . It is clear that the infimum is unique 
and nicely separated. It is easy to see that the derivative of PtH{Pt,p) with respect to p has a unique zero at 



mu 


Figure 2: Function PtH(Pt,p) for different windows and divergences. They all have an infimum 


at zero. 


p = 0. Besides function p 1 —>■ PtH{Pt,p) is strictly decreasing on (—oo,0) and strictly increasing on (0,oo). This 
is sufficient to prove our claim. Besides assumption 4 becomes well verified. 

We now move to verify the PT—integrability of x ^ ( p t%) ) have: 


P<j> 


P^r{x) 


^ (1 
{x)pt{x) = — 


y^2\'Y/2 (7 + l)u 

’ -g 2{l+^ 


V 2 - 


+ 1 2 I 2 /r, 

+7a;/j,-7p /2 


which is clearly integrable. 

Assumption 3 demands that the quantity sup^ / p'^{x)dx is finite. We have: 


{x)dx 


V/r G K. 


the Basu-Lindsay approach, this can happen if one can find a transparent kernel. 
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Hence the supremum over p, is equal to 1/y^ and assumption 2 is verified. 

Assumption 4 demands that the quantity sup^ „ ^ ( k *p^xp t ) have: 


Kw * PniVi) = — K 

n't!} ^ ^ 


i=i 


y^ - Vi 


K{0) , n-1 

- 1 ^ 


nw n — 1 




Vr - Vj 


The problem in the previous average is that the random variables inside are dependent but identically distributed and 
we cannot apply directly the law of large numbers. Let’s try and calculate an almost sure upper bound manually. We 
have: 


1 


n — 1 




Vi - Vj 
w 


1 


= e 


>6 2*1 


> e 2,1 


n — 




—^+- 
e 2^^ 


—vi- 

n — 1 ^ 


2u>2 


1 - 


2i(;2 n — 1 


E 


V%V3 


y] 


n — 1 


E 


Vj 


Now, the averages :;^^'^j^iy‘j and are sums of i.id. random variables. They are distributed inde¬ 
pendently of i by {n — 1) + 1 and Af {q, respectively. Moreover, the distribution of yi is independent of 

i. Hence, the distribution of the random variable = 1 — ^ independent of i. 

On the other hand, this random variable converges in probability (using the law of large numbers and the Slutsky’s 
lemma) to 1— which is strictly positive since > ^. Thus one can deduce the existence of uq independent of i 
such that, for n > uq the probability of the event > 1 — — c} is greater than 1 — r]„ for rj^ —>■ 0. The value 

of c is chosen such that c < 1 — This entails that: 

n 

-T. 


p<t>{yi) 

7 1 " 

< V 

P4>i.yy) 

.P<l>r{yz) X K^* Pn{yi)_ 

n ^ 

2 = 1 

.P^'r{y0e~^ {1 - 277 - c). 


It suffices now to prove the boundedness of the sum which is a mean of i.i.d. random variables and one can use the 
law of large numbers to conclude an approximation, and the Glivenko-Cantelli theorem in order to conclude a result 
about the supremum over (j). The limit in probability is given by: 


P<piyi) 


.P,p'r{yi)e 2„ 


dPr = / exp 


(7 + I)?/;"* + (1 - 7)w;2 - 7 2 , 1 2 

2^2(1+ u;2) y+^yy 2 ^ 


It is clear that (7 + l)6(;‘* + (l —7)16^ —7 needs to be positive in order for the theory to be applicable. A simple calculus 
shows that w needs to verify the following condition: 


2 . 7 - 1 + \/57^ + 27 + 1 
w > — 


2(7+1) 

Under this condition on w, the previous integral can be calculated and is given by: 


(15) 


P4>iyt) 


1 'y 


.P<t>r {yi)t 


dPx = exp 
yja 


1_ 

2a 


7/i 


where a = —-■ In order for the supremum over fi to exist, we need that a > 7, i.e. (7 + l)w^ + (1 — 

7)16^ — 7 > 7w^(l + w'^). This happens if w verify: 


27 - 1 + ^/4^^ 
w > ---. 


(16) 


Condition (16) contains (15), and hence is the one to be more interesting. For example, for^ = 0.1, the corresponding 
condition on the window is w > 0.33. For a 100-sample from the standard gaussian distribution, the window 
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corresponding to the Silverman’s rule of thumb is in average 0.35 whereas the Sheather and Jones’ window is in 
average 0.39. When adding 10% outliers from a gaussian distribution A/'(10,1), these values become 0.41 and 0.427 
in average. It is important to notice that, the preceding analysis is very simplistic and was based on the naive and 
relatively harsh inequality e“ > x + 1. Thus, one should be able, using a more rigorous analysis, to get a better lower 
bounds on the bandwidth of the window. 

Finally, It is important to notice that the existence of the kernel-based MDipDE /!„ is guaranteed since function 
P -> PnII{Pni p) has the form of the function e~^ . Besides, it has a limit equals to 0 as p tends to ±oo. Moreover, 
it is continuous on M. as a function of p, hence it is bounded and the infimum exists. 


2.6 Influence Function for a given window 

In practice, the choice of the window is based on methods such as cross-validation, gaussian approximations or even 
based on personal experience. Thus, it is interesting to study the robustness properties supposing that the window 
is generated by an external tool. 

We will use the influence function (IF) approach which, although being limited to the existence of a noise-component, 
is easy to calculate in generafl'^ and gives an aspect of the robustness of an estimator whenever the IF is bounded. 
We derive here in this paragraph the influence function of the new MDt^DE for the class of power divergences. The 
general case of function ip seems to give an incomprehensive formula, and is not as interesting as the case of power 
divergences. Recall that the later contains many classical divergences such as the Hellinger, the Pearson’s 
the Neymann’s one. 

Let C be a functional which gives for a probability distribution P the estimator corresponding to the argument of 
the infimum of PH{P, (p) defined earlier, i.e. 

Hence, C{Pn) is non other than the estimator given by (8) for a given w. Fisher consistency is translated by 
C{Pff,T) = (jp". This is unfortunately not verified in general when the window is supposed to be calculated by an 
external tool, because the dual formula is a priori a lower bound of Dy,(P^, P^pr), and we cannot be sure that it 
would verify the same identiflability property, i.e. D(Q, P) = 0 iS P = Q whenever p is strictly convex. Example 1 
shows, however, a case where Fisher consistency is attained for any value of the window w. 


The influence function measures the impact of a small perturbation in the distribution P on the resulting esti¬ 
mator. It is hence defined by: 

6 —>-0 S 

We generally detect the influence of an outlier xq by observing what happens when we replace P by (1 — e)P -I- eS^g. 
In the literature of M-estimates, one may derive the IF from the estimating equation. For power divergences, the 
estimating equation corresponding to P is given by: 

V fjf ,, 

(17) 


7 


7-1 




The influence function is obtained by "deriving" the two sides with respect to e after having replaced P by (1 — 
e)P eQ. The following result give the formula of the IF for power divergences when the noise is generated by an 
arbitrary distribution Q or when an outlier is present. 

Theorem 11 The influence function of the kernel-based MDpDE defined by (8) for a given window is given by: 




A 


-1 


(K^ * PtP 


-(x)dQ(x). (18) 


If C is Eisher consistent, i.e. C{Pt) = (jP', then the influence function is given by: 

' plZ[Kw*Q]"^P4>r / P \ / N . A-1 f pIZ'^P4,t 


IF{Pt.Q) 


7% 


-1 


(iL„ * P)' 




Einally, if Q = 5 xq, then the IE is given by: 


iHPr.x,) = 'Xa-' U_ p 

^ ^ w J (iF„*PT)T' V K,,*Pt)^ ’ 




^^This is regardless of the theoretical justifications of its existence. 
^^The arginf function is a troubelsome function 
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Proof. By deriving the left hand side of (17), we get: 


7 

7-1 


(7 - l)Vpc'(P) i'^PciP))* +Pc{p)J] 


Po(r) 


7-2 

PC(P)^ 


* P)T'-l 




The right hand side gives: 


(7- l)Vpc(p) (Vpc(p))*Pc(P) +Pc(p) '^PC{P) 


[Kn. * {Q - P)] ^PC(P) 


{K^ * py 


Let A be the matrix given by: 


(cc)(iP(a;)IF(P,Q)-7 


{K^ * P)7+i 
' Pc~{P)^Pc{P) 


x)dP{x) 


{K^ * py 


{x){dQ — dP){x). 


A = 


7 _ P(x) 

7 -1 K^* P 


(7 - l)VPC(P) (VPC(P)) +7'C(P)'^pc( 


p) 


7-2 

PciP) 


{K^ * P) 


7-1 


We have now: 


AIF(P,Q)=7 


pI\p) [K^ * (Q - P)] Vpc(P) 
{K^ * PV 


pI(1)^Pc(p) 
{K^ * P)y 


x)dx - 


{x){dQ — dP){x) 


■ 7 


Pc(p) *{Q- P)] '^PCiP) 


{x)dP{x) 


which, assuming A is invertible and using the estimating equation (17), may be rewritten as: 

.7-1 -p^7j.)Vpc(P) 




■A- 


(Hu, * PP 


(x)dQ(x). 


The remaining of the proof is a simple substitution of C{P) by (jP when P = P^t, and replacing Q by the dirac 
measure on a point xq. m 

Remark 12 The form of the IF is somewhat similar to the IF of the classieal MDipDE defined hy (6). Toma and 
Broniatowski [2011] show that the IF of the classical MDipDE is given hy: 


IE{Pt,x) = J 


-1 


P<j>T 


where J is the information matrix given by f 


^P^TlVp^Tf 


Pd,T 


. Going back to the IE of the new MDpDE given by (20), 


replacing * Pp by cancels the first term whereas the second term gives A ^ f 


-, where A = J -\—^ . 

p T ’ 7-1 


Intuitively, our modification has resulted in the term (p- t.p)T which could oblige the IE to be bounded in some cases. 
This is the ratio between the true density and the smoothed one. When 7 > 0, it is surprising that the IF becomes more 
bounded as the ratio between the true distribution and the smoothed one decreases, which means that the smoothing 
is producing over estimation at the tail of the distribution. 

Example 13 We resume the univariate gaussian example. Let’s calculate the IF given by (20) since, as already seen 
in Example 2, the new MDipDE is Eisher consistent. 

The quantity (p: ♦p)-y (^0) the only term which varies. It is given by: 


PlT^^P<t,P 

{K^ * Pp 


(so) = (1 + w'^y^'^XQe 2” e 1) 2 ei'2(i+™2) 


= (1 + ui^)'^/^a;oe 2(1^+™=)' 


Hence, this quantity is bounded as soons as j > 0. The second quantity is an integral which needs to exist and be 
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finite. We have: 

- Xo)/w)Yp^T 

{K^ * 


1 - 


P 


K,i, * P 


(a:) = 




e 2(l+u;2) _ g 2 


w 




■ exp 


+ 1 

2w'^{l + w'^)‘ 


vT+1 

2 XXq 


e 2(1+™^) 


a;, 


2 1 
0 


W 


X 


1 


Vi 


■ 2^2 

_g 2(1+™2) _ g 2 




/i is clear now that if "/ > 0, the integral exists. We should not forget that the integral term also depends on xq. The 
dominating term is e~^°, so that the integral term is hounded as a function of xq as soon as the integral exists. 

It remains to show that the term A exists and is invertible. Since Vpc/)T = xe~^ and Jp .j, = (1 + x‘^)e~^ then: 


(7 - l)Vpc(P) (Vpc(P)) + Pc{P)Jpc( 


p) 


7-2 

Pc(p) 


{K^ * P)^-i 


1 + ^2 
27r 


(1 + 7X^)e . 


Hence, 


A 


1 +1(;2 


27r 


1 +1(;2 


27r 


7 

7- 

7 

7- 


1 


1 


(l + 7a;"^)e ^(1+™^) dec— 


1 + 


(1 + jx^)e 2{i+™^) da: 



where a = and b = ■ It is clear that for 7 G ( 0 , 1 ), i/ie two terms constituting A have the same sign, 

hence A cannot be zero since it is the sum of two negative terms. However, 1/7 > 1 , A may by zero for some cases. 
Indeed, A is 0 whenever 7^(1 + 7 + 27w^)^(l + (7 + — (7 — 1)(1 + 'u:^)(l + 7 + (7 + 2 )w^y = 0 . Notice that 

function w 7^(1 + 7 + 2 jw‘^)‘^{l + (7 + — (7 — 1)(1 + + 7 + (7 + 2 )w‘^)‘^ is equal to 2 j — I > 0 when 

w = 0, whereas it has a —00 limit at +00. Thus, it passes by zero since it is a continuous function. 

Previous arguments permit us to conclude for sure that for ^ G (0,1), the influence function of the estimator defined by 
(8) is bounded in the gaussian model independently of the bandwidth of the gaussian kernel. Moreover, it is unbounded 
for 7 < 0. Hence, one can hope to get a robust estimation when 7 S (0,1). However, further investigations are needed 
for the case of'y < 0. 


3 The Basu-Lindsay approach 


The idea of smoothing the empirical distribution was at first employed to avoid the problem of absolute continuity 
of the model with respect to dP„ when we use the later to replace the true distribution in (1), see Beraii [1977] for 
the case of the Hellinger distance. Basu and Lindsay [1994] argue that the use of such methods require consistency 
and rates of convergence for the kernel estimator. They propose to smooth not only the empirical distribution, but 
also the model. For example, if the smoothing is by convolution with a symmetric kernel K such as the gaussian 
kernel, the Basu-Lindsay approach is summarized in the following two lines: 


pU^) 




- / py>iy)K 
W .u 


x-y 

w 


dy, 


arg inf 
4>e<s> 


/ p;i^) \ 

\ Hn.w (^) / 


{Yjdx, 


( 21 ) 


where Kn,w{x) = ) is the Parzen-Rosenblatt symmetric-kernel estimator. The authors prove the 

robustness of (21) using the residual adjustment function (RAF), see Lindsay [1994], since the corresponding influence 
function is generally unbounded, keeping first order efficiency in hand. There is still the choice of the kernel and 
its window, since their theoretical study demands a transparency assumption of the kernel^^ which is not verified in 
general. A transparent kernel ensures no loss of information when smoothing the model density. They also show in 
simple examples that even when we use non transparent kernel, loss of information is not big provided that we are 
using a convenient kernel. 

For example, in the gaussian model the gaussian kernel verifies the transparency property. Besides, the 

smoothed model is merely a gaussian density with variance equal to a'^ + h^. Thus, the Basu-Lindsay approach 
appears as if we are calculating a divergence between a weighted version of the model and the kernel estimator. 

^®The transparency assumption here means that the smoothed score function (derivative of the log-likelihood) is proportional to the 
non smoothed one. The proportion rate can only be a function of the parameters. 









































3 THE BASU-LINDSAY APPROACH 


19 


3.1 Smoothing-the-model’s effect 

The Basu-Lindsay approach seems to be more sensitive to the choice of the kernel than standard methods. For 
example, let’s take the case of densities defined on (0,oo) (with zero possibly included). Simple examples of such 
distributions are Weibull distributions and generalized Pareto distributions (GPDs). It is well-known that estimation 
based on symmetric kernels is biased near zero. Thus, smoothing the model with such kernels will result in similar 
bias near zero. Figure 3 shows the influence of a gaussian kernel on a GPD model. The smoothed model has a 
peak near zero and decreases then towards zero, and hence largely underestimates the values of the "non smoothed" 
model near zero. Thus, the divergence calculates a distance between a biased estimator of the true distribution and a 
biased model, and there is no intuitive guarantee of what should give the minimization of such a function. Standard 
methods which do not smooth the model would suffer less from this sort of problems since the bias is only in the 
kernel estimator. 

Simulation results show that among the three methods which use a kernel estimator (Beran’s approach, the Basu- 
Lindsay approach and our kernel-based MD(pDE) the Basu-Lindsay approach is the most sensitive one. Under the 
model, all three methods do not give satisfactory results in comparison to the MLE (or the classical MD(/jDE) when 
we use symmetric kernels. When outliers are present, even the Basu-Lindsay estimator still gives a better result than 
the MLE. 



Figure 3: Smoothing the model with a gaussian kernel results in a great loss in information. The use of an asymmetric 
kernel such as the the reciprocal inverse gaussian (RIG) seems to be a good alternative. 

The solution for the previous problem is of course to either use a bias-correction method, see Karunamuni and Alberts 
[2005], or to use asymmetric kernels which do not suffer from the boundary bias, see Libenguc Dobelc-kpoka [2013]. 
A more intriguing example is a Weibull distribution with shape parameter in (0,1). The density function explodes 
to infinity as we approach from zero^"*’. Gases such as GPD models can be treated efficiently using bias-correction 
methods since these assume that the support is semi-closed. Models which has singularities such as the Weibull 
model can be treated using asymmetric kernels such as gamma kernels or reciprocal inverse gaussian kernels^®. 
These methods can be employed to recover a good performance in the Basu-Lindsay approach and give better results 
for the Beran and our kernel-based MDtpDE. 

Let’s see how this kind of solution can be applied on the Basu-Lindsay approach. We discuss only the case of 
asymmetric kernels since similar arguments apply for bias-correction methods. Let / be the asymmetric-kernel 

course, if we are defining the Weibull distribution with a location parameter, the pdf explodes to infinity near the value of the 
location parameter. 

Asymmetric kernels have an attractive property that they can treat both bounded and unbounded densities. 
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estimator defined by: 


fix) 


1 

nc{yi,--- ,yn) 


n 

'y ^ Kx,wiyi), 

i=l 


where Kx^w is the asymmetric kernel calculated at observation yt, and c{yi,--- ,yn) is a constant which ensures 
integrability to 1. For example, K is the gamma kernel: 


^x,w iy) 


r/ii 


T{1 + x/w)h'^+^/'^ 




for y G [0, oo), 


where F is the classical gamma function. Estimator / can no longer be defined as the convolution between the 
asymmetric kernel and the empirical distribution in the same way as symmetric ones. Thus, the smoothed model in 
the Basu-Lindsay approach can no longer be obtained by simple convolution. It is given by: 

/.oo ^ 

p*4^) = ^^Kx,wiy)p<t,iy)dy, 


where c{y) is a function which normalizes the kernel for each value of y in order to be a density. It is given by: 


POO 

c{y) = / Kx^^{y)dx. 

Jo 

Unfortuantely, this normalization function cannot be calculated but numerically. Taking into account the number 
of integrations needed to perform such a task and the calculus of the divergence afterwards which also needs 
numerical integration, we get a great complexity. In comparison to the classical approach of Bcran [1977], the 
calculus of the smoothed model imposes two extra embedded integrals making the calculus of the (/?—divergence very 
difficult on two levels. The first one is the execution time, and the second one is the subtlety of the whole calculus 
since all these integrals are carried out over slow decreasing functions on the half line^®. 


Remark 14 We were unable to use asymmetric kernels in the Basu-Lindsay approach, because integration calculus 
(three embedded ones) failed even when restricting the calculus of the normalizing function c{y) on a finite interval. 
The execution time using the statistical tool R Core Team [2015] on an il laptop with 8G RAM took 12 minutes for 
a simple calculus of the smoothed model. One can imagine now the execution time of the (f— divergence and finally 
the optimization over (f. The method should work if one can handle efficiently the problem of numerical integrations 
and give close results to the case when we do not smooth the model. 

Remark 15 The use of the normalization function is necessary to get a very small loss of information. If it is 
not used, there will be a similar underestimation near zero to the case of symmetric kernels when applied on models 
defined on a semi-closed intervals. 

Very recently, Mnatsakanov and Sarkisian [2012] have proposed a method which does not contain a normalization 
function. Their approach is based on the so called Mellin transform to approximate the distribution function and 
then derive an estimate of the density function. Their estimator called as varying kernel density estimator (vKDE) 
is defined by: 



This estimator is different from estimators defined based on symmetric or asymmetric kernels as explained by the 
authors. They provide a bias-corrected version of this estimator to reduce the bias at the boundary. Nevertheless, 
we prefer to use (22) because it integrates to 1 and the Basu-Lindsay approach can be performed more efficiently 
and reasonably in comparison to the use of asymmetric kernels when working with distributions defined on the half 
line. The parameter a is a natural number, and (22) is Ll-consistent as a goes to infinity under suitable conditions. 
It even achieves the optimal rate of convergence for MSE and MISE. 

It is important to notice that fa(0) = 1 for a > 1. Thus, it is preferable to be used for densities which have value 
equal to 0 at 0 or for densities which are defined on (0,oo). In kernel-based estimation procedures, the value at zero 
is not important because it disappears in integration calculus. Besides, no observation will have exactly the value 
zero. 

'^^The calculs of bounded integrals is far more simple than infinit integrals. Besides, a slow decreasing function (at the border of the 
its domain), even if it is smooth, is harder to be handled by numerical integration methods than fast decreasing ones. 
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4 Advantages and disadvantages of the new reformulation 

The new reformulation of the minimum dual tp—divergence has apparently many advantages in comparison with the 
classical approach and the Basu-lindsay’s estimator. We list some of these points. 

• The role of the kernel estimator appears directly in the formula of the IF and the ratio between the true 
distribution and its smoothed version is the part which controls the boundedness of the IF. The method also 
inherits its robustness from the fact that it approximates a (/j—divergence. Simulation results that it still copes 
with the performance of both the MLE and classical MDtpDE when we are under the model. 


• In comparison with the classical MD(^DE, our new approach has omitted the double optimization by approx¬ 
imating the argument of the supremum in the dual representation. This constitutes a very important step 
since on the one hand, the double optimization requires a greater execution time which is of order equals 
to the square of the time needed for a simple optimization^^. On the other hand, the supremal form of the 
objective function to be minimized afterwords creates further complications in studying the regularity (con¬ 
tinuity and differentiability) of the supremal function which play an important role in optimization methods. 
It is true that optimization methods for non differentiable functions exist, nevertheless, these methods suffer 
from low convergence speed rates in comparison to methods which use the gradient of the objective function 
such as first order gradient descent (or the hessian matrix such as second order gradient descent and the BFGS). 


• Our approach contains only one integration which should be calculated numerically, whereas the smooth-of- 
the-model techinque in the Basu-Lindsay’s estimator creates another integration which should be calculated 
numerically in general. Besides this calculus intervenes inside an external integration calculus. Thus the 
number of numerical integrations is highly increased depending on the difficulty of the external integration^®. 
Besides, the use of asymmetric kernels or bias-correction methods is not possible since these methods add 
another internal integral unless one solve all these integrals efficiently, see Sec. 3 for more details. 

• There is a difference between our new approach and direct smoothing techniques. Although the performances 
and estimates are close, our approach keeps the philosophy of approximating a divergence between the model 
and the empirical distribution. In the Basu-Lindsay’s approach or classical methods of inserting a kernel in (1) 
such as Bcran [1977], the divergence is calculated between the empirical distribution and the (smoothed) model. 


We list some of the drawbacks of our approach: 

• Our method still suffers, similarly to the Basu-Lindsay’s approach and any method which uses kernels, from 
the problem of choosing the kernel and the window. This problem stays minor as long as we are working 
with regular densities which converges to zero at both extremities of support. When the density tends to 
infinity on the border or does not converge to zero, asymmetric kernels or bias-correction methods are needed. 
Unfortunately, these two tools, although efficient, lack a good and a general method for the choice of the window. 


• Unlike the Basu-Lindsay’s approach, we were not able to reformulate a general condition such as kernel trans¬ 
parency in order to avoid the need to consistency of the kernel estimator in some cases^®. Note, however, that 
this transparency condition is still a very hard task, and if it is not verified, consistency of the kernel is needed. 


• As we will see in the simulation paragraph, our kernel-based MDf/jDE has apparently traded some of its efficiency 
with a robustness properties. It is therefore not as good as the MLE and the classical MDtpDE under the model. 


• Our approach is not suitable for working with multidimensional distributions, since in higher dimensions, the 
so called curse of dimensionality appears and the neighborhoods of observed data becomes void. Thus the 
calculus of the kernel estimator would require much more data than univariate problems. This is not the case 
of the classical approach. We still can use projection-based nonparameteric estimators to replace the kernel and 
do the job. We also present hereafter a particular solution to contamination models which can be generalized 
directly to multivariate cases. 


'^^There is also the initialization problems for each internal optimization calculus. 

'^®Difiiculty comes from a bad shape of the integrand sometimes or irregularities. It also comes from functions with low decreasing rate 
at infinity for infinite integrals. 

'^^For our kernel-based MDc/sDE, the gaussian location model does not need consistency of the kernel when the gaussian kernel is used. 
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5 The Dual divergence estimator 

5.1 General facts and comments 


The dual divergence estimator (Dt^DE) was defined in Broniatowski and Keziou [2009] as the argument of the 
supremum in (5). It is defined by: 


d„ = arg sup 



{x)p^{x)dx - 



2=1 


Pa \PaJ 




(23) 


This estimator is far more simple than the classical MD(/jDE defined by ( 6 ) since it needs only a simple optimization 
over a for a given choice of the escort parameter (/). Besides, this estimator is proved to be robust in some models 
from an IF point of view, provided a suitable choice of the escort parameter. Indeed, the IF is given by (see Toma 
and Broniatowski [2011]): 


IF(j/|</)) = 



Jf{x)p^T{x)dx 


{x)\7^p^T{x)dx - 


\P4>'rJ P,j>-r{y) \ 


where: 


r p^ 

f{a,4‘,y)= / -Apcpdx 

J Pa 


.Pa 



Previous papers which discussed the choice of the escort parameter have either let the choice arbitrary in the region 
where the IF is bounded (this is the case of Toma and Broniatowski [2011]), or proposed to use robust estimates for 
the escort parameters (this is the case of Cherfi [2011] and Frydlova ct al. [2012]). The first idea is very complicated 
since we have no idea about the true value of the parameters and a bad choice of the escort parameter even inside 
the region where the IF is bounded does not ensure a good result. In Frydlova ct al. [2012] and Cherfi [2011], 
experimental results show that the Df/^DE in a normal model is very close to the escort parameter and coincide with 
the escort parameter when the later is equal to the MLE. The last fact can be easily verified following the proof of 
Theorem 6 in Broniatowski [2014]. Indeed, one may show that the MLE is a zero of the estimating equation of the 
D(^DE and has a definit negative Jacobian matrix of the corresponding objective function. On the other hand, the 
use of a robust escort parameter is not always a good idea. We discuss these two ideas on two examples. 


Example 16 We resume the two-eomponent gaussian mixture example. We have already shown that the elassieal 
MDifDE has an unbounded IF in this model in paragraph 1.2. The IF of the DtpDE is not the same. We will try and 
give some conditions on the escort parameter in order to make it bounded. The first term in the influence function is 
a matrix which is independent of y and is constant. Supposing that it is invertible, our job is to investigate both the 
existence of the integral, which is also a constant, and the remaining term which changes according to y. The integral 
exists since the the fraction is of order whereas the derivative is of order . The remaining term needs to be 
studied extensively. The fraction 'was already studied in the case of the MDpDE. We, therefore, need only 

to study the fraction ' 


P4> 

P<j>T 


Xe-hiv-vif+ {l-X)e-hiv-ni? ^ 
XTe-h(v-vTT + (1 _ XT)e-"2(.v-vfY ^ 

^ (X — —Ml )+2 Ml ~ 2 ^2 

at + (1 _ AT)eM(Mf-Mr) + i(Mf) = -i(Mf; 

1 _ A + 

1 _ at + 




M 2 )“t" 2 1 M 2 ) 2 M 2 


When y tends to —oo, if pi > pf, then the second line shows that the fraction gives a finite limit equals to 0. 
Otherwise, it gives +oo. When y tends to +oo, if P 2 < M 2 ') third line shows that the fraction gives a finite limit 

equals to 0. Otherwise, it gives + 00 . We need to incorporate this with the terms of the vector derivative 

with respect to X, is already bounded, and hence no additional condition is needed. The derivative with respect to pi 
is also bounded at + 00 . However, at —00 it is of order y. Still, it vanishes against the term )+iOi) “tMi 

which comes from the fraction under conditions 7 > 0 and pi > pj. Finally, the derivative with respect to 

P 2 is treated similarly. 

We conclude that provided that the matrix term is invertible, the influence function of the DipDE is bounded whenever 
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the escort parameter verify either of the following conditions according to the value ofj: 

Ml > , M2 < if 7 > 0 (24) 

Ml < Mi^) M2 > M^ if 7 < 0 (25) 

The use of a robust escort parameter verifying the set of conditions (24, 25) leads to a more robust parameter than 
the escort. However, the use of a robust escort parameter which does not fulfill the set of conditions (24, 25) has a 
negative impact on the resulting estimator. In our simulations in Sec. 1, we have analyzed the mixture whose true 
set of parameters is (A^ = 0.35,/if = —2,/if = 1.5). We used our new MDipDE (with a Silverman’s rule for the 
window) as an escort parameter. The divergence criterion is the Hellinger divergence which corresponds to j = 0.5. 
Thus, we are in the context of condition (24). The new MDipDE verify this condition and the resulting DpDE has 
in average a better error, see table 16. In the same table, we give another escort parameter which as good as the 
previous one depending on our two error criteria, and even slightly better. If we calculate the DipDE using this escort 
parameter which clearly does not verify condition ( 24 ), the resulting estimator does not give a better estimate than 
the escort. It is clearly worse since the error has nearly been doubled. 


Estimator 


Total variation 

= (A = 0.349,/ti = -1.767,/i2 = 1.377) 

0.155 

0.087 

02 = (A = 0.36,/ii = -2.2, As = 1.7) 

0.142 

0.079 

DipDE(0i) 

0.142 

0.076 

D(pDE( 02 ) 

0.213 

0.115 


Table 1: The influence of a robust escort parameter on the Di/sDE in a mixture of two gaussian components. The 
error is calculated between the true distribution and the estimated one, see Sec. 7. 


Example 17 Let p,j, be a generalized Pareto distribution: 

Pu,aiy) = - ", fory>0. 

The shape and the scale are supposed to be unknown and equal to = 0.7, = 3. It is necessary for the IE of the 

D(pDE to be boundecP^ following the value of ^ to locate the shape of the escort parameter with respect to the true 
value of the shape parameter. If j & (0,1), it is necessary for the IF to be bounded that v < . If j < 0, then the 
IF can be bounded whenever v > lA'. Our simulation results in paragraph 7.3 shows that for 7 = 0.5 (the hellinger 
divergence), the DpDE calculated using a robust escort parameter (our kernel-based MDipDE) has deteriorated the 
performance significantly. The total variation distance corresponding to the escort parameter is 0.05 whereas the total 
variation distance corresponding to the DipDE is 0.12. The escort parameter gives an estimate of the shape parameter 
0.766 which seems to be a good estimate. It is worth noting that it still gives better results than those obtained using 
MLE which gives a total variation distance equal to 0.195. 


The past two examplesform an opposed result to the conjuncture of both articles Frydlova ct al. [2012] and Chcrfl 
[2011] about the use of robust escort parameter. The use of a robust escort is a gamble and does not guarantee a 
better estimator than the escort itself. Thus, we are taking a great risk by using the D(/jDE. Notice, Anally, that the 
D(/)DE is still more robust than the MLE and the classical MD(/ 3 DE even if the IF is not bounded. 

5.2 Relation with the density power divergences 

The density power divergence (MDPD) was first introduced by Basil et al. [1998]. It is defined by: 

/ I ^ 

pl+^{z)dz- 

i 

= arginf [p^] - “^E„ [p^] (26) 

(f) ^ 

^^The IF contains an inverse of a 2 x 2 matrix which cannot be simply calculated. Since it is a mere constant, we only discussed the 
other terms in the IF. 

^^See the remaining of the simulations for more examples. 
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Let’s look at the Di/jDE for power divergences with 7 = —a < 0. It is given by: 


a„ = arg sup 


6 $ 7-1 
1 


11 ^ r 17 

II ^ P9 


Pa 


= arg sup-- 

q,g$ a + 1 


P] 


a-\-l 


Pe 


{x)dx H- 


nr 1 a- 
Pa 


■ f f Pa^^ / ^ j 1 + a 1 
= arginf / - (xjdx -> 

aG$ J Pg ^ 


nr -1 a 

Pa 


Pe 


pe 
{Vi) 


ivi) 

{y^) 


= arg inf Eq 


pe J 


(2 -|- 1 




(27) 


A simple comparison between (26) and (27) gives that, the Dt/sDE seems to be a penalized form of the MDPD. This 
penalization by a density pe creates a big trouble from a robustness point of view. The robustness of the D(/ 3 DE 
is now not only controlled by the divergence power a = — 7 but also through pg. We have seen in the previous 
paragraph that the robustness of the DtpDE in a two-component gaussian mixture varies according to the position of 
9 from the true set of parameters. The difficulty of this escort parameter constitutes the only drawback of the DyjDE 
in comparison to the MPD. It is still a positive point in its favor. Indeed, the penalization by pg can be reread in 
the spirit of Broniatowski and Keziou [2006]. The ratio Pajpe is the Radon-Nikoyme density of Pa with respect to 
P^. Thus, one can define the Di/jDE even if Pa is not absolutely continuous with respect to the Lebesgue measure 
on K. This fact cannot be done in the MDPD. 


6 A solution for the case of contamination models 


We define a contamination model to be the following mixture model: 

Pt = {I — e)P^T + eQ 

for e S [0,1) which should be small. We have already seen here above that the main problem in the classical MD:^DE 
is that the dual representation (4) largely underestimate the divergence between the true distribution and the model 
when the data is contaminated. Since the supremum is attained when pa = Pt, the model Pa cannot cope with 
the contamination part eQ keeping a good distance from the main part of the distribution (1 — e)P^T. In order to 
reestablish the supremum attainment, or at least reduce the gap between the dual representation and the true value 
of the divergence D^{P^, Pt), we propose to replace Pa by a contaminated model (1 — X)pa + Xqg. This corresponds 
to the use of the following class of functions in the dual formula of the divergence (3) : 


Pg = 




P4> 


(1 — X)pa + Xqg 


e c 0 e 0 c , A e [0,1) 


The minimum dual divergence estimator can now be defined by: 


= arg inf 
06 $ 


sup 

aG$,SG0,AG[O,l) 





(1 - X)pa + Xqg) 


^ {x)p^{x)dx 


P^ 

^(1 - X)pa + Xqg ^ 


(28) 


When we replace by Pt, the supremum is attained whenever a = (jP", X = e,qg = dQ/dx. Hence, if we are under 
the model, i.e. e = 0 and dQ/dx is in the submodel {qg)g^Q, the previous estimator is Fisher consistent unlike our 
estimator defined by ( 8 ). 

The estimator defined by (28) is clearly a modified version of the classical MD(/ 5 DE defined by ( 6 ), and show what 
the classical approach misses. The advantage of such an approach is that we can use it in multidimensional problems 
without further modifications (unlike our first approach given in paragraph 1.3). On the other hand, the choice of a 
model for the contamination part may be easier than the choice of the kernel and its window in the MD(/jDE defined 
by ( 8 ), since there is already a whole theory in the literature of time series for modeling the contamination (noise) 
in a dataset. 

The influence function of the estimator defined by (28) can be calculated similarly to the classical MDyjDE (see 
Toma and Broniatowski [2011]). However, the general case when dQ/dx is not a member of the submodel {qg)g^e is 
very complicated. There is still a simple case when the contamination Q is a member of the submodel {qg)g^Q. In 
this case, the attainement of the supremum in the dual representation permits us to use directly Dip{P^^ Pt) which 
is proved to be a robust tool, see Donoho and Liu [1988]. Thus, the influence function is unbounded and is the same 
as the influence function of a (/?—divergence D^{P^,Pt) which is the same as the IF of the MLE (and the classical 
MD(/?DE), see Lindsay [1994]. 
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7 Simulation study 

We summarize the results of 100 experiments by giving the average of the estimates and the error committed, and 
the corresponding standard deviation. We consider two error criteria. The total variation distance (TVD) and the 
Chi square divergence between the true distribution and the estimated one. These criteria are defined as follows: 




{p<piy) - p<j>^iy)y 

p^^iy) 


dy, 


TVD{p^,P^t) 


sup I dP^ (A) - dP^T (A) 

AgB„(R) 


We prefer to use the Chi square divergence, because it measures the relative error between two probability laws. 

Hence, the error committed on sets where the true distribution attributes small values is penalized in a similar way 
to sets where the true distribution attributes large values. We use also the TVD because it has the property of 
measuring the largest error committed when measuring a set A using the estimated distribution instead of the true 
one. The TVD can be directly calculated using the LI distance. Indeed, the Shceffe lemma (see Mcister [2009] page 
129.) states that: 

sup \dP^{A) - dP^T{A)\ = }- I \p<p{y) - P 4 ,T{y)\dy. 

agb„(r) ^ Jr 

We consider the Hellinger divergence for estimators based on divergences. The parameter vector is estimated 
using five methods: 

1. Maximum likelihood (MLE) which is calculated using EM for mixture models; 

2. The classical MD:/jDE defined by (6); 

3. Our kernel-based MD(^DE defined by (8) with different choices for the kernel and its bandwidth; 

4. The Basu-Lindsay approach with different choices for the kernel and its bandwidth; 

5. The dual :y9-divergence estimator (DtpDE) defined by (23) with escort parameter the result of our kernel-based 
MDt/sDE with the best choice of the kernel and window among presented possibilities; 

6. The minimum power density estimator (MPD) of Basil et al. [1998] defined by (26) for a € {0.1, 0.25, 0.5, 0.75,1}. 

We give for each experiment a summary of the results with comments, and precise the used kernels and the corre¬ 
sponding windows choices. We finally give an overall conclusion with some practical remarks. 

Optimization were done using the Nelder-Mead algorithm. Integrations calculus were done using function distrExIntegrate 
of package distrEx which is a slight modification of the standard function integrate. It performs a Gauss-Legendre 
quadrature when function integrate returns an error. We have noticed that functions such as integral of package 
pracma^^, although has a good performance, is slow. Besides, function int of package rmutil, which uses either 
the Romberg method or algorithm 614 of the collected algorithms from ACM, seems to underestimate the value of 
the integral in slightly difficult circumstances such as heavy tailed distributions. For example, when we used it to 
calculate the classical MD(/5DE in the GPD case, it gave robust results because it underestimated the infinity part of 
the integral (forged thresholding effect). Finally, during some experiences on GPD observations and Weibull distri¬ 
butions based on the Basu-Lindsay approach, function distrExIntegrate failed to converge and function integral 
was used to attain a result. 

Our simulation study covers the following models: 

1. Gaussian model with unknown mean and variance; 

2. Two gaussian mixtures with two components where the proportion and the two means are unknown; 

3. Generalized Pareto distribution with unknown shape and scale; 

4. Three Weibull mixtures with two components where the proportion and the two shapes are unknown. 

Outliers were added in the original data in many ways which will be specified according to each case. We have either 
added noise outside the support of the dataset or by dispersing the noise over the whole dataset. We have also used 
different distributions to produce the noise. 

For the first two models, we only used a gaussian kernel with window chosen using either Silverman’s rule (nrdO 
in the statistical tool R) or Sheather and Jones’ rule (SJ). For the heavy tailed models which are defined on half 
the real line, we needed to use non classical kernels such as asymmetric kernels (RIG: reciprocal inverse gaussian 
and GA: gamma kernels) and the varying KDE of Mnatsakanov and Sarkisian [2012] denoted here as MT (Mellin 

^^Function integral includes a variety of adaptive numerical integration methods such as Kronrod-Gauss quadrature, Romberg’s 
method, Gauss-Richardson quadrature, Glenshaw-Gurtis (not adaptive) and (adaptive) Simpson’s method. 
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transform) defined here above by (22), followed by the value of the bandwidth a G {5,10,15,20}. In the GPD model 
and the first Weibull mixture, we present a simple comparison between symmetric kernels and other non classical 
methods and showed the advantage of the later in such context. We therefore avoided using symmetric kernels for 
other Weibull mixtures. For the Basu-Lindsay approach, we did not implement asymmetric kernels, see discussion 
in paragraph 3.1. We only used the varying KDE. 

In what concerns the rule for deciding the window for the non classical kernels, we have tried out the cross-validation 
method (CV), but it resulted always in large (small for the varying KDE) and inconvenient windows especially when 
outliers are inserted. We were, therefore, obliged to use fixed windows in order to obtain good results. For each 
kernel and method, the window value or the rule used to calculate it is written next to it. More details can be found 
at each paragraph. 

7.1 Univariate gaussian model 

We consider the gaussian distribution A/'(^, cr^) when both parameters /i and a are unknown. We generate at each 
run a 100-sample of the standard gaussian distribution Af{0, 1). Outliers are added simply by replacing the 10 largest 
values in the sample by the value 10. 

The maximum likelihood estimator of the parameters are simply the empirical mean and variance /t = = 

{Vi — For niethods which need kernels, we used a gaussian kernel with two rules for the window; Silver¬ 
man’s rule and Sheather and Jones’ one. We calculate the power density estimator (MPD) for values of the tradeoff 
parameter a G {0.1,0.25,0.5,0.75,1}. The Dt^DE was calculated using the kernel-based MD(^DE as an escort with 
the Silverman’s rule. Estimation results are summarized in table 2. Estimation error is calculated in table 3. When 
we are under the model, all compared methods give the same result with very slight differences. As we add 10% 
outliers, the classical MDtpDE and the MLE give the same result which is positively deviated from the true mean 
with a large variance. This is already expected by virtue of the result of Broniatowski [2014]. Other methods, 
ours included, give robust results except for MPD with a = 0.1. Our estimator (for both windows choices) is at 
the same level of efficiency as the MLE under the model. Besides, the window choice seems irrelevant for methods 
based on kernels but for Beran’s method where Silverman’s rule is a bit better. The MPD seems to give the best 
tradeoff between efficiency and robustness for a = 0.5 conquering other methods. The kernel-based MDt/sDE and the 
Basu-Lindsay approaches give slightly better efficiency which is traded with slightly lower robustness in comparison 
to the result of MPD with a = 0.5. 


Estimation 

No Outliers 

10% Outliers 

method 


sd(^) 

a 

sd(tT) 


sd(^) 

a 

sd((T) 

Bellinger 

Classical MD(^DE 

0.005 

0.111 

0.983 

0.082 

0.833 

0.103 

3.157 

0.039 

New MDtpDE - Silverman 

0.005 

0.113 

0.967 

0.081 

-0.187 

0.114 

0.810 

0.069 

New MDtpDE - SJ 

0.005 

0.113 

0.973 

0.082 

-0.191 

0.114 

0.800 

0.068 

Basu-Lindsay - Silverman 

0.005 

0.114 

0.968 

0.081 

-0.191 

0.114 

0.805 

0.068 

Basu-Lindsay - SJ 

0.005 

0.113 

0.970 

0.081 

-0.193 

0.114 

0.799 

0.067 

Beran - Silverman 

0.005 

0.113 

1.024 

0.087 

-0.191 

0.114 

0.878 

0.075 

Beran - SJ 

0.005 

0.112 

1.048 

0.089 

-0.192 

0.114 

0.853 

0.073 

MPD 0.1 

0.005 

0.112 

0.983 

0.082 

0.319 

0.111 

2.451 

0.079 

MPD 0.25 

0.006 

0.112 

0.983 

0.083 

-0.145 

0.114 

0.854 

0.074 

MPD 0.5 

0.008 

0.117 

0.979 

0.087 

-0.115 

0.116 

0.875 

0.081 

MPD 0.75 

0.010 

0.123 

0.975 

0.093 

-0.093 

0.120 

0.894 

0.089 

MPD 1 

0.012 

0.129 

0.971 

0.098 

-0.077 

0.124 

0.910 

0.094 

D(^DE 

0.005 

0.112 

0.982 

0.082 

-0.164 

0.114 

0.873 

0.080 

MLE 

0.005 

0.111 

0.988 

0.082 

0.833 

0.103 

3.172 

0.039 


Table 2: The mean value and the standard deviation of the estimates in a 100-run experiment in the standard 
gaussian model. The divergence criterion is the Bellinger divergence. The escort parameter of the D(/jDE is taken as 
the new MDtpDE with the Silverman bandwidth choice. 
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Estimation 

No Outliers 

10% Outliers 

method 


sd(x^) 

TVD 

sd(TVD) 


sd(x^) 

TVD 

sd(TVD) 

Hellinger 

Classical MDt^DE 

0.104 

0.052 

0.054 

0.026 

8.503 

0.113 

0.516 

0.002 

New MD(/?DE - Silverman 

0.106 

0.052 

0.056 

0.028 

0.230 

0.063 

0.136 

0.041 

New MD(/?DE - SJ 

0.105 

0.052 

0.055 

0.027 

0.239 

0.062 

0.141 

0.041 

Basu-Lindsay - Silverman 

0.105 

0.052 

0.055 

0.028 

0.235 

0.062 

0.139 

0.040 

Basu-Lindsay - SJ 

0.105 

0.052 

0.055 

0.027 

0.240 

0.062 

0.142 

0.040 

Beran - Silverman 

0.114 

0.063 

0.054 

0.025 

0.191 

0.067 

0.110 

0.042 

Beran - SJ 

0.125 

0.076 

0.057 

0.026 

0.205 

0.066 

0.119 

0.042 

Di^DE 

0.104 

0.052 

0.054 

0.026 

0.183 

0.068 

0.105 

0.042 

MPD 0.1 

0.104 

0.051 

0.053 

0.026 

5.772 

0.356 

0.411 

0.013 

MPD 0.25 

0.105 

0.052 

0.054 

0.026 

0.185 

0.066 

0.107 

0.042 

MPD 0.5 

0.110 

0.054 

0.057 

0.028 

0.165 

0.068 

0.094 

0.042 

MPD 0.75 

0.116 

0.060 

0.060 

0.032 

0.152 

0.070 

0.086 

0.043 

MPD 1 

0.121 

0.066 

0.063 

0.036 

0.144 

0.070 

0.080 

0.043 

MLE 

0.104 

0.052 

0.053 

0.025 

8.522 

0.111 

0.518 

0.002 


Table 3: The mean value of errors committed in a 100-run experiment with the standard deviation. The divergence 
criterion is the Hellinger divergence. The escort parameter of the D(pDE is taken as the new MDi^DE with the 
Silverman bandwidth choice. 
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7.2 Mixture of two gaussian components 

We show in this paragraph several simulations from a two-component gaussian mixture where the data is contami¬ 
nated or not by a 10% of outliers. We present two mixtures. The first one has the following parameters A = 0.35, fii = 
—2,^2 = 1-5. The second one has closer components (means). Its parameters are A = 0.45, = —0.5, ^2 = 2. Vari¬ 

ances of both components are supposed to be fixed at 1. The two mixtures are ploted in figure 4. We are only 
interested in the means and the proportion of each class. Contamination was done for the first mixture by adding 
in the original sample to the 5 lowest values random observations from the uniform distribution U\—5,—2]. We 
also added to the 5 largest values random observations from the uniform distribution U\2,5]. Estimation results are 
summarized in table 4. Estimation error is calculated in table 5. For the second mixture, contamination was done by 
adding in the original sample to the 5 lowest values random observations from the uniform distribution 3,—1]. 
We add to to the 5 largest values random observations from the uniform distribution Z/f[l,3]. Estimation results are 
summarized in table 6. Estimation error is calculated in table 7. Maximum likelihood estimates are calculated using 
the EM algorithm. 



Figure 4: The two gaussian mixtures. 

In what concerns the first mixture (table 5): When we are under the model, all compared methods give the same 
performance. When outliers are added, both classical MD(/jDE and MLE are not robust and give the same result. 
Other methods provide robust results. The choice of the window has a clearer influence than in the gaussian case. 
The Silverman’s rule gives better results for kernel-based approaches. Error values are close for robust methods and 
MPD O.I is the best one (unlike the univariate gaussian). 

In what concerns the second mixture: When we are under the model, slight differences appear in favor of the 
classical MDf/jDE and the MLE (calculated using EM). When we add the outliers, these two estimators fail. MPD 
for a = 0.1,0.25 and the Basu-Lindsay approach also fail in the eye of the distance. Our kernel-based MDc/jDE 
have close robustness to the remaining estimators; the MPD for a = 0.5 and Beran’s method. The error is more 
sensitive and show higher differences in favor of our approach against the Basu-Lindsay approach and the minimum 
power density for small values of the tradeoff parameter. This was basically because of some experiences which failed 
to converge to a model where the two components are near 0 and considered the second component as the negative 
noised part of the data. Thus a great relative error has occurred. 
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Estimation 

No Outliers | 

10% Outliers | 

method 

A 

sd(A) 

Ml 

sd(fii) 

M2 

sd(^2) 1 

A 

sd(A) 

Ml 

sd(^ti) 

M2 

Sd{fl2) 1 

1 Hellinger | 

Classical MDi^DE 

0.360 

0.054 

-1.989 

0.204 

1.493 

0.136 

0.342 

0.064 

-2.617 

0.288 

1.713 

0.172 

New MDi^DE - Silverman 

0.360 

0.054 

-1.993 

0.208 

1.499 

0.133 

0.349 

0.058 

-1.767 

0.226 

1.377 

0.135 

New MDvpDE - SJ 

0.359 

0.054 

-1.981 

0.206 

1.490 

0.134 

0.346 

0.059 

-1.706 

0.218 

1.333 

0.136 

Basu-Lindsay - Silverman 

0.361 

0.055 

-1.979 

0.207 

1.490 

0.139 

0.339 

0.062 

-1.927 

0.305 

1.377 

0.158 

Basu-Lindsay - SJ 

0.360 

0.054 

-1.977 

0.203 

1.486 

0.135 

0.346 

0.059 

-1.751 

0.227 

1.339 

0.140 

Beran - Silverman 

0.371 

0.050 

-1.985 

0.203 

1.546 

0.132 

0.369 

0.053 

-1.788 

0.218 

1.477 

0.134 

Beran - SJ 

0.366 

0.052 

-1.983 

0.204 

1.522 

0.134 

0.355 

0.056 

-1.743 

0.217 

1.384 

0.136 

D<^DE 

0.361 

0.054 

-1.988 

0.203 

1.492 

0.136 

0.355 

0.056 

-2.132 

0.224 

1.605 

0.137 

MPD 0.1 

0.360 

0.054 

-1.991 

0.207 

1.493 

0.134 

0.346 

0.059 

-2.052 

0.243 

1.452 

0.144 

MPD 0.25 

0.360 

0.053 

-1.994 

0.213 

1.492 

0.133 

0.351 

0.057 

-1.832 

0.223 

1.394 

0.134 

MPD 0.5 

0.360 

0.053 

-1.997 

0.226 

1.489 

0.136 

0.353 

0.056 

-1.819 

0.218 

1.404 

0.132 

MLE (EM) 

0.360 

0.054 

-1.989 

0.204 

1.493 

0.136 

0.342 

0.064 

-2.617 

0.288 

1.713 

0.172 


Table 4: The mean value and the standard deviation of the estimates in a 100-run experiment in a two-components 
gaussian mixture. The divergence criterion is the Hellinger divergence. The escort parameter of the D(/?DE is taken 
as the new MDt^DE with the Silverman bandwidth choice. 


Estimation 

No Outliers 

10% Outliers 

method 


sd(x^) 

TVD 

sd(TVD) 

x"" 

sd(x^) 

TVD 

sd(TVD) 

Hellinger 

Classical MDt^DE 

0.113 

0.044 

0.064 

0.025 

0.335 

0.102 

0.150 

0.034 

New MD(/?DE - Silverman 

0.113 

0.045 

0.064 

0.025 

0.155 

0.059 

0.087 

0.033 

New MD(/?DE - SJ 

0.113 

0.045 

0.064 

0.025 

0.179 

0.061 

0.101 

0.035 

Basu-Lindsay - Silverman 

0.115 

0.043 

0.065 

0.024 

0.155 

0.073 

0.085 

0.033 

Basu-Lindsay - SJ 

0.113 

0.043 

0.064 

0.024 

0.170 

0.062 

0.096 

0.035 

Beran - Silverman 

0.113 

0.046 

0.064 

0.025 

0.132 

0.050 

0.073 

0.027 

Beran - SJ 

0.112 

0.045 

0.063 

0.025 

0.157 

0.057 

0.087 

0.032 

Di^DE 

0.112 

0.044 

0.064 

0.025 

0.142 

0.061 

0.076 

0.031 

MPD 0.1 

0.113 

0.044 

0.064 

0.025 

0.124 

0.052 

0.069 

0.029 

MPD 0.25 

0.114 

0.045 

0.064 

0.025 

0.140 

0.054 

0.079 

0.030 

MPD 0.5 

0.117 

0.047 

0.065 

0.025 

0.138 

0.053 

0.078 

0.030 

MLE 

0.113 

0.044 

0.064 

0.025 

0.335 

0.102 

0.150 

0.034 


Table 5: The mean value of errors committed in a 100-run experiment with the standard deviation. The divergence 
criterion is the Hellinger divergence. The escort parameter of the D(pDE is taken as the new MDi^DE with the 
Silverman bandwidth choice. 


Estimation 

No Outliers | 

10% Outliers | 

method 

A 

sd(A) 

Ml 

sd{fii) 

M2 

sd(M2) 1 

A 

sd(A) 

Ml 

sd(Mi) 

M2 

sd(M2) 1 

1 Hellinger | 

Classical MDy>DE 

0.457 

0.077 

-0.487 

0.240 

2.006 

0.187 

0.437 

0.128 

-0.860 

0.478 

2.192 

0.343 

New MDi^DE - Silverman 

0.457 

0.077 

-0.488 

0.242 

2.006 

0.191 

0.444 

0.098 

-0.409 

0.376 

1.873 

0.240 

New MDvpDE - SJ 

0.456 

0.077 

-0.490 

0.242 

2.009 

0.191 

0.443 

0.098 

-0.381 

0.376 

1.851 

0.235 

Basu-Lindsay - Silverman 

0.460 

0.079 

-0.470 

0.247 

2.004 

0.189 

0.406 

0.150 

-0.834 

0.880 

1.89 

0.386 

Basu-Lindsay - SJ 

0.460 

0.078 

-0.472 

0.246 

2.008 

0.190 

0.410 

0.144 

-0.762 

0.888 

1.857 

0.352 

Beran - Silverman 

0.464 

0.066 

-0.533 

0.221 

2.080 

0.180 

0.456 

0.076 

-0.494 

0.233 

2.012 

0.225 

Beran - SJ 

0.465 

0.064 

-0.541 

0.213 

2.096 

0.178 

0.453 

0.080 

-0.454 

0.230 

1.964 

0.219 

Dy>DE 

0.457 

0.077 

-0.487 

0.239 

2.006 

0.187 

0.447 

0.086 

-0.661 

0.283 

2.100 

0.231 

MPD 0.1 

0.456 

0.077 

-0.492 

0.238 

2.005 

0.191 

0.424 

0.142 

-0.843 

0.872 

2.015 

0.504 

MPD 0.25 

0.456 

0.076 

-0.497 

0.236 

2.003 

0.199 

0.441 

0.097 

-0.505 

0.443 

1.912 

0.243 

MPD 0.5 

0.455 

0.076 

-0.503 

0.241 

2.000 

0.212 

0.453 

0.080 

-0.394 

0.234 

1.906 

0.205 

MLE 

0.457 

0.077 

-0.487 

0.240 

2.006 

0.187 

0.432 

0.146 

-0.964 

0.706 

2.222 

0.593 


Table 6: The mean value and the standard deviation of the estimates in a 100-run experiment in a two-components 
gaussian mixture with close means. The divergence criterion is the Hellinger divergence. The escort parameter of 
the D(/?DE is taken as the new MDi^DE with the Silverman bandwidth choice. 
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Estimation 

No Outliers 

10% Outliers 

method 


sd(x^) 

TVD 

sd(TVD) 

x"" 

sd(x^) 

TVD 

sd(TVD) 

Hellinger 

Classical MDt^DE 

0.108 

0.050 

0.061 

0.029 

0.294 

0.528 

0.122 

0.044 

New MD(/?DE - Silverman 

0.110 

0.051 

0.062 

0.029 

0.156 

0.245 

0.081 

0.044 

New MD(/?DE - SJ 

0.109 

0.052 

0.062 

0.029 

0.163 

0.242 

0.085 

0.042 

Basu-Lindsay - Silverman 

0.110 

0.050 

0.062 

0.029 

0.961 

3.366 

0.097 

0.067 

Basu-Lindsay - SJ 

0.110 

0.050 

0.063 

0.029 

0.982 

3.606 

0.092 

0.067 

Beran - Silverman 

0.113 

0.050 

0.062 

0.027 

0.114 

0.053 

0.065 

0.031 

Beran - SJ 

0.114 

0.051 

0.062 

0.026 

0.111 

0.053 

0.064 

0.032 

Di^DE 

0.108 

0.050 

0.061 

0.029 

0.150 

0.075 

0.081 

0.034 

MPD 0.1 

0.108 

0.050 

0.062 

0.029 

2.745 

10.73 

0.090 

0.067 

MPD 0.25 

0.110 

0.051 

0.063 

0.029 

0.589 

4.676 

0.072 

0.042 

MPD 0.5 

0.114 

0.052 

0.065 

0.030 

0.121 

0.059 

0.072 

0.037 

MLE 

0.108 

0.050 

0.061 

0.029 

1.813 

6.76 

0.130 

0.057 


Table 7: The mean value of errors committed in a 100-run experiment with the standard deviation in a mixture 
of two gaussian components with close means. The divergence criterion is the Hellinger divergence. The escort 
parameter of the D</?DE is taken as the new MDi^DE with the Silverman bandwidth choice. 
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7.3 Generalized Pareto distribution 

We show in this paragraph several simulations from the generalized Pareto distribution (GPD) where the data is 
contaminated or not by a 10% of outliers. A GPD with a fixed location at zero, a scale parameter tr > 0 and a shape 
parameter > 0 is defined by: 

Pu,a{y) = - + , for ?/> 0 

cr V a/ 

We generate 100 samples. Each sample contains 100 observations drawn independently from the same distribution 
GPD(j^ = 0.7, cr = 3). Outliers are added by replacing 10 observations (chosen randomly) from each sample by 
observations from the distribution GPD(^ = l,cr = 10,^ = 500) where y is the location parameter. Estimation 
results are summarized in table 8. Estimation error is calculated in table 9. The maximum likelihood estimator was 
calculated using the gpd.fit function of package ismev. 

In the litterature of nonparametric density estimation, it is mentioned everywhere that symmetric kernels are not 
suitable for densities defined on half the real line because of the boundary effect. We, however, still use them here 
for the sake of comparison when they are employed inside an estimation criterion. For more details, we invite the 
reader to revisit paragraph 3.1. 

When we are under the model. All presented methods except for the Basu-Lindsay approach attained the same 
efficiency of the MLE and sometimes even better for given choices of the kernel or the tradeoff parameter. Our 
kernel-based MD(/jDE attained a similar performance to the MLE for all non classical kernels and the corresponding 
choices of the window. Beran’s method attained this performance only with the varying KDE (MT 5,10,15,20). MPD 
attained this level only for small values of a (0.25 and 0.1). Other kernel choices were not very successful except 
for our kernel-based MD(^DE with a gaussian kernel and a Silverman’s rule. This may be some indication of small 
sensitivity to the kernel used. 

When outliers are added, performance of kernel-based methods is slightly deteriorated whereas other methods are 
greatly influenced, and the error is at least doubled; MPD for all cases included. The use of asymmetric kernels seems 
to be the most convenient for a GPD model. Our kernel-based MD(/5DE seems to give the best result (in and 
TVD) for all kernels and corresponding windows keeping a great marge in its favor in comparison with other methods. 

Why does the Basu-Lindsay approach give bad results in a GPD model using a gaussian kernel? A natural answer 
is that the gaussian kernel is not suited for densities which do not go to zero at both extremities of the domain of 
definition of the true distribution as was already indicated in Sect. 3. It is well known that symmetric kernels have 
the so-called boundary effect or bias. In the Basu-Lindsay approach, this fact has a double bad effect. The first 
is on the kernel estimator which no longer is appropriate to replace the true distribution near zero. The second is 
on the smoothed model. When the model is smoothed with a gaussian kernel, a great loss in information occurs 
in comparison to the original model, see Fig (3). Now that both the kernel estimator and the smoothed model are 
"corrupted”, the divergence between them is no longer related to the divergence between the model and the empirical 
distribution. The use of asymmetric kernels or bias-correction methods was not possible practically because these 
methods provide non normalized estimators, see paragraph 3.1 Remark 5. This causes further difficulties in numerical 
integrations while smoothing the model, and requires higher execution time than possible. We therefore used the non 
classical kernel estimator based on the Mellin transform defined by (22). This estimator is normalized by construction 
and is free of boundary bias. Results based on such estimator are a clear improvement. 

Last but not least, it is worth noting that both asymmetric kernels gave very close results for all kernel-based methods. 
In the remaining experiences, we will only be using the reciprocal inverse gaussian (RIG) kernel. 

Remark The nature of the heavy tail of the GPD (slow decrease at infinity) made integration calculus difficult, and 
some integration functions failed to give fairly correct results. We, therefore, and in order to avoid integration on an 
infinite interval [0,oo), propose to use a quantile trick which is translated by the change of variable: 

lo (^) ^ (^) 

where F^^(y) = ^((1 — j/)*^ — ly) — 1) is the quantile of the GPD probability law P^. Although this idea may appear 
ineffective since it does not change anything in the integral (the quantile funtion takes back values from [0,1) into 
[0,oo)), it was the savior from using other integration functions such as function int which work, but largely un¬ 
derestimate the true value, see discussion at the beginning of this section. In fact, integration methods perform in 
general better when integrating on a finite interval than when integrating on an infinite one. 
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Estimation 

No Outliers 

10% Outliers 

method 

u 

sd(:^) 

a 

sd(CT) 

V 

sd{iy) 

a 

sd(cr) 

Hellinger 

Glassical MD(pDE 

0.721 

0.174 

3.029 

0.575 

1.655 

0.113 

2.694 

0.491 

New MDi^DE - Gauss Silverman 

0.463 

0.142 

2.719 

0.586 

0.571 

0.197 

2.427 

0.599 

New MDi^DE - Gauss SJ 

0.343 

0.108 

2.858 

0.597 

0.368 

0.141 

2.798 

0.569 

New MDvjDE - RIG GV 

0.528 

0.140 

3.125 

0.611 

0.775 

0.202 

2.844 

0.571 

New MDvjDE - RIG NrdO 

0.562 

0.139 

3.133 

0.605 

0.817 

0.219 

2.815 

0.545 

New MDvjDE - RIG SJ 

0.522 

0.129 

3.138 

0.616 

0.688 

0.191 

2.903 

0.574 

New MDvjDE - GA GV 

0.530 

0.139 

3.117 

0.610 

0.766 

0.204 

2.833 

0.577 

New MDvjDE - GA NrdO 

0.564 

0.139 

3.112 

0.601 

0.814 

0.211 

2.787 

0.544 

New MDvjDE - GA SJ 

0.520 

0.126 

3.135 

0.607 

0.691 

0.185 

2.895 

0.576 

New MDvjDE - MT 5 

0.641 

0.156 

3.217 

0.615 

1.202 

0.161 

2.806 

0.510 

New MDi^DE - MT 10 

0.607 

0.153 

3.272 

0.628 

1.090 

0.195 

2.876 

0.552 

New MDvsDE - MT 15 

0.588 

0.150 

3.307 

0.636 

1.026 

0.206 

2.920 

0.565 

New MDi^DE - MT 20 

0.573 

0.148 

3.331 

0.643 

0.979 

0.212 

2.956 

0.577 

Basu-Lindsay - Gauss Silverman 

0.128 

0.125 

6.022 

1.522 

0.122 

0.109 

7.151 

2.025 

Basu-Lindsay - Gauss SJ 

0.078 

0.066 

4.603 

1.057 

0.097 

0.087 

4.843 

1.316 

Basu-Lindsay - MT 5 

0.833 

0.156 

2.232 

0.651 

0.765 

0.189 

2.937 

0.666 

Basu-Lindsay - MT 10 

0.853 

0.197 

2.297 

0.659 

0.777 

0.193 

2.880 

0.704 

Basu-Lindsay - MT 15 

0.881 

0.176 

2.293 

0.517 

1.164 

0.169 

2.893 

0.530 

Basu-Lindsay - MT 20 

0.907 

0.180 

2.337 

0.603 

0.936 

0.206 

2.694 

0.580 

Beran - Gauss NrdO 

0.216 

0.108 

5.165 

1.218 

0.197 

0.125 

6.084 

1.546 

Beran - Gauss SJ 

0.231 

0.108 

3.988 

0.919 

0.229 

0.134 

4.135 

0.939 

Beran - RIG GV 

0.516 

0.134 

3.890 

0.832 

0.833 

0.218 

3.944 

0.745 

Beran - RIG NrdO 

0.515 

0.138 

4.441 

1.026 

0.878 

0.233 

4.229 

0.954 

Beran - RIG SJ 

0.507 

0.136 

3.813 

0.787 

0.732 

0.200 

3.641 

1.113 

Beran - GA GV 

0.486 

0.134 

3.936 

0.847 

0.745 

0.207 

4.097 

0.822 

Beran - GA NrdO 

0.475 

0.139 

4.510 

0.998 

0.778 

0.220 

4.547 

1.032 

Beran - GA SJ 

0.503 

0.133 

3.780 

0.773 

0.703 

0.186 

3.589 

0.781 

Beran - MT 5 

0.711 

0.150 

3.384 

0.640 

1.339 

0.140 

2.979 

0.551 

Beran - MT 10 

0.665 

0.150 

3.315 

0.620 

1.231 

0.155 

2.900 

0.530 

Beran - MT 15 

0.637 

0.154 

3.310 

0.640 

1.164 

0.169 

2.893 

0.530 

Beran - MT 20 

0.627 

0.156 

3.302 

0.637 

0.936 

0.206 

2.694 

0.580 

D(pDE 

0.720 

0.179 

3.026 

0.580 

1.45 

0.290 

2.749 

0.524 

MPD 1 

0.729 

0.402 

3.023 

0.660 

1.039 

0.483 

3.273 

0.681 

MPD 0.75 

0.716 

0.331 

3.025 

0.631 

1.021 

0.416 

3.242 

0.645 

MPD 0.5 

0.715 

0.263 

3.023 

0.603 

1.028 

0.361 

3.171 

0.605 

MPD 0.25 

0.722 

0.200 

3.019 

0.581 

1.292 

0.240 

2.955 

0.532 

MPD 0.1 

0.723 

0.175 

3.019 

0.568 

1.564 

0.154 

2.779 

0.500 

MLE 

0.719 

0.174 

3.031 

0.58 

1.654 

0.113 

2.695 

0.492 


Table 8: The mean value and the standard deviation of the estimates in a 100-run experiment in the GPG model. 
The divergence criterion is the Neymann Ghi square divergence or the Hellinger. The escort parameter of the D(/?DE 
is taken as the new MDi^DE with the Silverman bandwidth choice. 




7 SIMULATION STUDY 


33 


Estimation 

No Outliers 

10% Outliers 

method 


sd(x'') 

TVD 

sd(TVD) 


sd(x^) 

TVD 

sd(TVD) 

Hellinger 

Classical MD(pDE 

0.099 

0.077 

0.044 

0.026 

1.027 

0.195 

0.142 

0.014 

New MDi^DE - Silverman 

0.159 

0.056 

0.087 

0.034 

0.171 

0.070 

0.097 

0.044 

New MDvsDE - SJ 

0.189 

0.052 

0.100 

0.035 

0.183 

0.066 

0.098 

0.042 

New MD^DE - RIC CV 

0.109 

0.045 

0.058 

0.027 

0.114 

0.065 

0.053 

0.029 

New MDvsDE - RIC NrdO 

0.100 

0.044 

0.054 

0.027 

0.142 

0.130 

0.056 

0.029 

New MDvsDE - RIC SJ 

0.110 

0.044 

0.059 

0.027 

0.104 

0.056 

0.054 

0.030 

New MDvsDE - CA CV 

0.108 

0.045 

0.058 

0.027 

0.114 

0.063 

0.054 

0.029 

New MD^DE - CA NrdO 

0.100 

0.044 

0.054 

0.027 

0.132 

0.092 

0.056 

0.028 

New MDvsDE - CA SJ 

0.109 

0.044 

0.058 

0.027 

0.104 

0.056 

0.054 

0.030 

New MDvsDE - MT 5 

0.093 

0.053 

0.049 

0.028 

0.472 

0.307 

0.089 

0.024 

New MDi^DE - MT 10 

0.095 

0.050 

0.051 

0.028 

0.336 

0.243 

0.078 

0.026 

New MDvsDE - MT 15 

0.097 

0.048 

0.053 

0.028 

0.268 

0.193 

0.072 

0.027 

New MDvjDE - MT 20 

0.099 

0.047 

0.054 

0.029 

0.226 

0.154 

0.068 

0.028 

Basu-Lindsay - Silverman 

0.301 

0.08 

0.179 

0.048 

0.361 

0.110 

0.214 

0.061 

Basu-Lindsay - SJ 

0.256 

0.046 

0.145 

0.033 

0.264 

0.055 

0.151 

0.039 

Basu-Lindsay - MT 5 

0.155 

0.082 

0.090 

0.047 

0.100 

0.077 

0.051 

0.036 

Basu-Lindsay - MT 10 

0.155 

0.080 

0.085 

0.043 

0.102 

0.078 

0.053 

0.038 

Basu-Lindsay - MT 15 

0.140 

0.107 

0.071 

0.050 

0.421 

0.278 

0.086 

0.025 

Basu-Lindsay - MT 20 

0.157 

0.085 

0.078 

0.044 

0.160 

0.083 

0.059 

0.031 

Beran - Causs NrdO 

0.241 

0.072 

0.142 

0.045 

0.297 

0.090 

0.177 

0.053 

Beran - Causs SJ 

0.199 

0.049 

0.109 

0.034 

0.207 

0.044 

0.114 

0.032 

Beran - RIC CV 

0.133 

0.060 

0.076 

0.038 

0.226 

0.128 

0.094 

0.041 

Beran - RIC NrdO 

0.164 

0.085 

0.097 

0.051 

0.306 

0.235 

0.114 

0.054 

Beran - RIC SJ 

0.123 

0.060 

0.069 

0.039 

0.146 

0.097 

0.070 

0.048 

Beran - CA CV 

0.136 

0.060 

0.078 

0.038 

0.195 

0.100 

0.094 

0.044 

Beran - CA NrdO 

0.169 

0.078 

0.101 

0.048 

0.267 

0.186 

0.121 

0.057 

Beran - CA SJ 

0.120 

0.058 

0.068 

0.037 

0.130 

0.078 

0.065 

0.040 

Beran - MT 5 

0.103 

0.067 

0.052 

0.030 

0.915 

0.729 

0.111 

0.022 

Beran - MT 10 

0.093 

0.057 

0.049 

0.029 

0.581 

0.615 

0.095 

0.023 

Beran - MT 15 

0.094 

0.054 

0.050 

0.029 

0.421 

0.278 

0.086 

0.025 

Beran - MT 20 

0.095 

0.055 

0.051 

0.029 

0.371 

0.298 

0.081 

0.026 

D(/?DE 

0.099 

0.077 

0.048 

0.028 

0.843 

0.407 

0.120 

0.030 

MPD 1 

0.211 

0.310 

0.068 

0.038 

0.477 

0.665 

0.089 

0.047 

MPD 0.75 

0.204 

0.389 

0.062 

0.034 

0.424 

0.545 

0.085 

0.043 

MPD 0.5 

0.141 

0.160 

0.056 

0.030 

0.419 

0.515 

0.082 

0.039 

MPD 0.25 

0.106 

0.082 

0.049 

0.028 

0.669 

0.441 

0.104 

0.030 

MPD 0.1 

0.099 

0.083 

0.047 

0.027 

0.955 

0.326 

0.133 

0.019 

MLE 

0.099 

0.077 

0.048 

0.026 

1.025 

0.195 

0.142 

0.014 


Table 9: The mean value of errors committed in a 100-run experiment with the standard deviation for the GPD 
model. The divergence criterion is the Neymann Chi square divergence or the Hellinger. The escort parameter of 
the D(pDE is taken as the new MDi^DE with the gamma kernel. 
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7.4 Mixtures of Two Weibull Components 

We present the results of estimating three different two-component Weibull mixtures. The model has the following 
density: 

p^ix) = + (1 - A)^ e-(t)". 

Scale parameters are supposed to be known and equal to 0.5 for the first component and 2 for the second component. 
The proportion is unknown and fixed at 0.35. Shape parameters are supposed unknown. Our examples cover a 
variety of cases of a Weibull mixture where the density function has either a finite limit at zero or goes to infinity 
for one of the components: 

1. a mixture with close modes vi = 1.2, 1^2 = 2; 

2. a mixture with one mode and with limit equal to infinity at zero vi — 0.5, 1^2 = 3; 

3. a mixture with no modes and with limit equal to infinity at zero vi = 0.5, 1^2 = 1. 

We plot these mixtures in figure 5. Outliers were added in different ways to illustrate several scenarios. For the first 
mixture, outliers were added by replacing 10 observations of each sample chosen randomly by 10 observations drawn 
independently from a Weibull distribution with shape i' — 0.9 and scale cr = 3. See tables (10) and (11). For the 
second mixture, we added to the 10 largest observations of each sample a random observation drawn from the uniform 
distribution U[2, 10]. See tables 12 and 13. For the third one, outliers were added by replacing 10 observations, chosen 
randomly, of each sample by observations from the uniform distribution Z^[maxyi,75] after having verified that no 
observation in the overall data has exceeded the value 50. See tables 14 and 15. 



Figure 5: The three Weibull mixtures used in our experience. 

The caclulus of the divergence between the estimated model and the true distribution gave often infinity on all 
mixtures for all estimation methods even under the model. This is because small bias in the estimation of the shape 
parameter results in a great relative error in both the tail behavior and near zero. We therefore, only provide the 
TVD as an error criterion. 

The first Weibull mixture was the least complicated case. We were able to get satisfactory results for our kernel-based 
MD(/5DE using a gaussian kernel. The two other mixtures were more challenging, and we needed to use asymmetric 
kernels to solve the problem of the bias near zero. It is worth noting that the Basu-Lindsay approach provided very 
bad estimates in the three mixtures which keeps it out of the competition. Note also that the use of a gaussian kernel 
gave very pleasant results for the first mixture in spite of the boundary bias. We excluded it from mixtures which 
have infinity limit at zero because it did not work well because of the large bias at zero. 
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For the first mixture, under the model all presented methods provide close results (and sometimes better) to the 
MLE except for the Basu-Lindsay approach with all available choices and Beran’s method with the varying KDE 
(MT) for windows 5 and 10 which fail. Under contamination, our method gives better results than all other methods 
and have very close (even slightly better) performance to the MPD for tradeoff parameter higher than 0.25. 

For the second mixture, the Basu-Lindsay approach failed again. Beran’s method gave good result under the model 
only in one case; the RIG with window 0.01. The density power MPD worked very well only for a tradeoff parameter 
lower than 0.5 and gave a good compromise between robustness and efficiency. It gave the best compromise in the 
presented methods. Our kernel-based MDi^DE has close results to MPD with difference of 0.01 in the TVD. It is 
worth noting that our kernel-based MDt^DE gave faire results for the two proposed kernels; the asymmetric kernel 
RIG for window 0.01 as before and the varying KDE MT for windows 10, 15 and 20. A fact which was not verified 
for other kernel-based methods showing again a less sensibility towards the kernel. 

For the third mixture, the Basu-Lindsay approach did not give good results especially under the model. The only 
satisfactory results (which gave a good tradeoff between robustness and efficiency) were obtained by our kernel- 
based MD(pDE for RIG kernel with window 0.01, Beran’s method with the same kernel and window and the MPD 
for a = 0.5. Our method and Beran’s gave the same result with difference of 0.015 in favor of the power density 
estimator. Better efficiency were obtained by other choices but on the cost of the robustness of the resulting estimator 
under contamination. 


Estimation 

No Outliers 

10% Outliers 

method 

A 

sd(A) 

Cl 

sd(ci) 

C2 

sd(c2) 

A 

sd(A) 

Cl 

sd(ci) 

C2 

sd(c2) 

Hellinger 

Classical MDyDE 

0.355 

0.066 

1.245 

0.228 

2.054 

0.237 

0.410 

0.257 

1.045 

0.255 

1.718 

0.849 

New MDyDE - Gauss Silverman 

0.384 

0.067 

1.221 

0.244 

2.138 

0.291 

0.348 

0.076 

1.121 

0.265 

1.822 

0.319 

New MDyDE - Gauss SJ 

0.387 

0.067 

1.227 

0.240 

2.188 

0.308 

0.356 

0.076 

1.133 

0.261 

1.905 

0.319 

New MDyDE - RIG 0.01 

0.371 

0.066 

1.297 

0.231 

2.215 

0.321 

0.355 

0.100 

1.213 

0.229 

1.955 

0.344 

New MDvpDE - RIG 0.1 

0.358 

0.065 

1.233 

0.210 

2.065 

0.267 

0.330 

0.117 

1.127 

0.226 

1.741 

0.304 

New MDy.DE - RIG SJ 

0.351 

0.066 

1.217 

0.207 

2.001 

0.245 

0.324 

0.132 

1.107 

0.226 

1.670 

0.297 

New MDyDE - MT 5 

0.328 

0.112 

1.301 

0.235 

1.809 

0.192 

0.363 

0.229 

1.195 

0.213 

1.592 

0.356 

New MDyDE - MT 10 

0.330 

0.091 

1.355 

0.235 

1.923 

0.220 

0.351 

0.204 

1.247 

0.230 

1.645 

0.285 

New MDyDE - MT 15 

0.327 

0.076 

1.383 

0.234 

1.973 

0.237 

0.348 

0.199 

1.275 

0.233 

1.680 

0.294 

New MDyDE - MT 20 

0.328 

0.076 

1.403 

0.233 

2.002 

0.249 

0.348 

0.198 

1.295 

0.235 

1.702 

0.297 

Basu-Lindsay - Gauss Silverman 

0.752 

0.064 

2.199 

0.248 

38.66 

8.66 

0.822 

0.083 

1.927 

0.276 

32.37 

13.52 

Basu-Lindsay - Gauss SJ 

0.723 

0.059 

2.205 

0.257 

16.18 

10.75 

0.759 

0.065 

1.958 

0.263 

19.52 

10.56 

Basu-Lindsay - MT 5 

0.403 

0.072 

1.339 

0.224 

3.241 

0.547 

0.346 

0.076 

1.260 

0.210 

2.874 

0.338 

Basu-Lindsay - MT 10 

0.390 

0.069 

1.409 

0.234 

3.281 

0.465 

0.337 

0.067 

1.319 

0.217 

2.813 

0.233 

Basu-Lindsay - MT 15 

0.393 

0.067 

1.458 

0.248 

3.297 

0.476 

0.333 

0.062 

1.340 

0.232 

2.823 

0.257 

Basu-Lindsay - MT 20 

0.399 

0.066 

1.472 

0.221 

3.282 

0.458 

0.335 

0.068 

1.362 

0.225 

2.819 

0.300 

Beran - Gauss Silverman 

0.254 

0.058 

1.313 

0.087 

2.010 

0.200 

0.182 

0.074 

1.174 

0.162 

1.703 

0.253 

Beran - Gauss SJ 

0.295 

0.067 

1.371 

0.104 

2.085 

0.225 

0.240 

0.079 

1.284 

0.127 

1.794 

0.266 

Beran - RIG 0.01 

0.368 

0.064 

1.240 

0.198 

2.147 

0.277 

0.339 

0.094 

1.151 

0.200 

1.858 

0.332 

Beran - RIG 0.1 

0.345 

0.061 

1.117 

0.103 

1.897 

0.172 

0.289 

0.095 

1.033 

0.125 

1.570 

0.247 

Beran - RIG SJ 

0.320 

0.060 

1.069 

0.074 

1.725 

0.138 

0.260 

0.123 

0.997 

0.088 

1.416 

0.203 

Beran - MT 5 

0.453 

0.307 

1.146 

0.178 

1.386 

0.180 

0.626 

0.349 

1.055 

0.172 

1.461 

0.531 

Beran - MT 10 

0.354 

0.201 

1.238 

0.201 

1.553 

0.133 

0.419 

0.304 

1.134 

0.202 

1.450 

0.425 

Beran - MT 15 

0.334 

0.153 

1.286 

0.211 

1.664 

0.143 

0.404 

0.277 

1.178 

0.188 

1.500 

0.370 

Beran - MT 20 

0.334 

0.136 

1.317 

0.218 

1.738 

0.156 

0.383 

0.256 

1.207 

0.198 

1.542 

0.348 

D(^DE 

0.356 

0.066 

1.248 

0.232 

2.069 

0.278 

0.332 

0.142 

1.113 

0.248 

1.700 

0.289 

MPD 1 

0.358 

0.087 

1.238 

0.252 

2.127 

0.521 

0.343 

0.113 

1.167 

0.239 

2.005 

0.517 

MPD 0.75 

0.353 

0.073 

1.236 

0.237 

2.088 

0.397 

0.341 

0.108 

1.164 

0.235 

1.951 

0.432 

MPD 0.5 

0.354 

0.068 

1.238 

0.230 

2.071 

0.345 

0.336 

0.105 

1.159 

0.237 

1.860 

0.344 

MPD 0.25 

0.354 

0.066 

1.239 

0.226 

2.053 

0.272 

0.324 

0.131 

1.132 

0.235 

1.699 

0.321 

MPD 0.1 

0.355 

0.066 

1.242 

0.227 

2.048 

0.238 

0.394 

0.241 

1.091 

0.215 

1.780 

0.792 

MLE (EM) 

0.355 

0.066 

1.245 

0.228 

2.054 

0.237 

0.321 

0.187 

0.913 

0.313 

1.575 

0.325 


Table 10: The mean value and the standard deviation of the estimates in a 100-run experiment on a two-component 
Weibull mixture (A = 0.35, i^i = 1.2, ^2 = 2). The escort parameter of the D(/?DE is taken as the new MD(pDE with 
the SJ bandwidth choice. 
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Estimation 

No Outliers 

10% Outliers 

method 

mean 

median 

sd 

mean 

median 

sd 

Hellinger 

Classical MD(pDE 

0.052 

0.048 

0.025 

0.108 

0.094 

0.099 

New MDi^DE - Gauss Silverman 

0.058 

0.054 

0.029 

0.068 

0.065 

0.034 

New MDi^DE - Gauss SJ 

0.058 

0.053 

0.029 

0.064 

0.061 

0.031 

New MDvjDE - RIG 0.01 

0.058 

0.052 

0.030 

0.059 

0.057 

0.030 

New MDvjDE - RIG 0.1 

0.051 

0.049 

0.026 

0.066 

0.062 

0.032 

New MDvjDE - RIG SJ 

0.050 

0.050 

0.026 

0.071 

0.066 

0.032 

New MDvsDE - MT 5 

0.057 

0.055 

0.025 

0.081 

0.074 

0.032 

New MDi^DE - MT 10 

0.054 

0.053 

0.026 

0.075 

0.071 

0.032 

New MDvsDE - MT 15 

0.054 

0.054 

0.026 

0.073 

0.069 

0.032 

New MDi^DE - MT 20 

0.055 

0.054 

0.027 

0.073 

0.069 

0.031 

Basu Lindsay - Gauss Silverman 

0.298 

0.289 

0.042 

0.247 

0.253 

0.050 

Basu Lindsay - Gauss SJ 

0.252 

0.256 

0.051 

0.242 

0.246 

0.044 

Basu Lindsay - MT 5 

0.127 

0.141 

0.046 

0.121 

0.111 

0.042 

Basu Lindsay - MT 10 

0.133 

0.136 

0.039 

0.117 

0.111 

0.036 

Basu Lindsay - MT 15 

0.134 

0.141 

0.039 

0.118 

0.110 

0.038 

Basu Lindsay - MT 20 

0.132 

0.138 

0.039 

0.117 

0.109 

0.039 

Beran - Gauss Silverman 

0.068 

0.062 

0.028 

0.082 

0.081 

0.031 

Beran - Gauss SJ 

0.060 

0.054 

0.028 

0.067 

0.065 

0.029 

Beran - RIG 0.01 

0.052 

0.048 

0.026 

0.060 

0.058 

0.029 

Beran - RIG 0.1 

0.042 

0.039 

0.020 

0.067 

0.061 

0.030 

Beran - RIG SJ 

0.045 

0.044 

0.017 

0.079 

0.076 

0.030 

Beran - MT 5 

0.099 

0.097 

0.016 

0.125 

0.125 

0.022 

Beran - MT 10 

0.073 

0.070 

0.021 

0.102 

0.100 

0.028 

Beran - MT 15 

0.064 

0.060 

0.022 

0.092 

0.089 

0.030 

Beran - MT 20 

0.059 

0.055 

0.023 

0.086 

0.084 

0.030 

D(/?DE 

0.053 

0.049 

0.027 

0.068 

0.065 

0.031 

MPD 1 

0.065 

0.061 

0.034 

0.068 

0.064 

0.030 

MPD 0.75 

0.059 

0.056 

0.029 

0.063 

0.060 

0.029 

MPD 0.5 

0.056 

0.052 

0.029 

0.061 

0.056 

0.029 

MPD 0.25 

0.052 

0.048 

0.027 

0.068 

0.067 

0.031 

MPD 0.1 

0.051 

0.048 

0.026 

0.088 

0.083 

0.039 

MLE 

0.052 

0.048 

0.025 

0.095 

0.098 

0.035 


Table 11: The mean value with the standard deviation of the TVA committed in a 100-run experiment on a two- 
component Weibull mixture (A = 0.35, = 1.2, 1^2 = 2). The escort parameter of the D(/?DE is taken as the new 

MD(/?DE with the SJ bandwidth choice. 
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Estimation 

No Outliers | 

10% Outliers | 

method 


sd(A) 

I'l 

sd(!/i) 

1^2 

sd(i/2) 1 


sd(A) 

I'l 

sd(i/i) 

1^2 

sd(i/2) 1 

Hellinger | 

Classical MDiy^DE 

0.344 

0.059 

0.497 

0.079 

3.063 

0.476 

0.376 

0.053 

0.339 

0.030 

2.892 

0.484 

New MDvpDE RIG - 0.01 

0.330 

0.061 

0.540 

0.140 

3.170 

0.503 

0.338 

0.061 

0.432 

0.105 

3.055 

0.583 

New MD</jDE rig - 0.1 

0.371 

0.063 

0.468 

0.138 

3.045 

0.452 

0.392 

0.072 

0.372 

0.085 

2.927 

0.464 

New MD</jDE rig - SJ 

0.395 

0.072 

0.442 

0.134 

3.013 

0.443 

0.424 

0.086 

0.354 

0.082 

2.916 

0.459 

New MD</jDE MT - 5 

0.311 

0.062 

0.520 

0.065 

2.875 

0.451 

0.316 

0.063 

0.376 

0.036 

2.699 

0.471 

New MD</pDE MT - 10 

0.302 

0.062 

0.548 

0.077 

2.903 

0.433 

0.306 

0.062 

0.384 

0.039 

2.727 

0.448 

New MDvpDE MT - 15 

0.295 

0.063 

0.564 

0.084 

2.927 

0.434 

0.298 

0.063 

0.388 

0.042 

2.745 

0.450 

New MD<^DE MT - 20 

0.289 

0.063 

0.575 

0.091 

2.943 

0.437 

0.291 

0.063 

0.392 

0.044 

2.758 

0.454 

Basu-Lindsay MT - 5 

0.250 

0.070 

0.834 

0.168 

2.849 

0.733 

0.185 

0.074 

0.715 

0.208 

2.189 

0.155 

Basu-Lindsay MT - 10 

0.240 

0.065 

0.797 

0.157 

2.789 

0.550 

0.197 

0.087 

0.707 

0.201 

2.324 

0.132 

Basu-Lindsay MT - 15 

0.254 

0.073 

0.745 

0.140 

2.915 

0.584 

0.204 

0.078 

0.674 

0.181 

2.352 

0.092 

Beran RIG - 0.01 

0.298 

0.058 

0.647 

0.082 

3.017 

0.437 

0.295 

0.057 

0.486 

0.081 

2.842 

0.460 

Beran RIG - 0.1 

0.234 

0.054 

0.652 

0.105 

2.374 

0.245 

0.216 

0.053 

0.408 

0.056 

2.149 

0.291 

Beran RIG - SJ 

0.194 

0.056 

0.653 

0.134 

1.936 

0.246 

0.142 

0.065 

0.402 

0.144 

1.601 

0.325 

Beran MT - 5 

0.250 

0.070 

0.463 

0.058 

1.603 

0.140 

0.245 

0.083 

0.340 

0.062 

1.494 

0.208 

Beran MT - 10 

0.278 

0.066 

0.501 

0.069 

2.005 

0.181 

0.275 

0.079 

0.354 

0.033 

1.868 

0.260 

Beran MT - 15 

0.286 

0.065 

0.524 

0.075 

2.224 

0.218 

0.284 

0.071 

0.365 

0.033 

2.068 

0.280 

D:^DE 

0.343 

0.059 

0.5004 

0.084 

3.047 

0.474 

0.372 

0.056 

0.357 

0.056 

2.897 

0.502 

MDE 0.75 

0.444 

0.126 

0.595 

0.080 

3.466 

0.643 

0.417 

0.127 

0.602 

0.087 

3.233 

0.606 

MDE 0.5 

0.376 

0.067 

0.551 

0.093 

3.159 

0.488 

0.357 

0.067 

0.555 

0.097 

2.980 

0.484 

MDE 0.25 

0.347 

0.061 

0.512 

0.096 

3.057 

0.472 

0.331 

0.062 

0.471 

0.068 

2.879 

0.491 

MDE 0.1 

0.344 

0.059 

0.496 

0.084 

3.050 

0.470 

0.343 

0.058 

0.384 

0.037 

2.859 

0.484 

MLE (EM) 

0.344 

0.059 

0.498 

0.079 

3.063 

0.476 

0.376 

0.053 

0.339 

0.303 

2.892 

0.482 


Table 12: The mean value and the standard deviation of the estimates in a 100-run experiment in a two-component 
Weibull mixture (A = 0.35, i^i = 0.5, ^2 = 3). The escort parameter of the is taken as the new MD(pDE with 

the Silverman bandwidth choice. 


Estimation 

No Outliers 

10% Outliers 

method 

mean 

median 

sd 

mean 

median 

sd 

Hellinger 

Classical MDt^DE 

0.060 

0.055 

0.024 

0.096 

0.094 

0.025 

New MD(/?DE RIG - 0.01 

0.074 

0.070 

0.034 

0.076 

0.073 

0.039 

New MD(/?DE RIG - 0.1 

0.079 

0.064 

0.053 

0.099 

0.086 

0.062 

New MD(/?DE RIG - SJ 

0.091 

0.075 

0.068 

0.120 

0.099 

0.078 

New MD(/?DE MT - 5 

0.062 

0.061 

0.027 

0.081 

0.073 

0.031 

New MD(/?DE MT - 10 

0.066 

0.064 

0.028 

0.076 

0.070 

0.030 

New MD(/?DE MT - 15 

0.069 

0.068 

0.028 

0.076 

0.071 

0.030 

New MD(/?DE MT - 20 

0.072 

0.073 

0.029 

0.076 

0.071 

0.030 

Basu-Lindsay MT - 5 

0.119 

0.114 

0.039 

0.131 

0.121 

0.029 

Basu-Lindsay MT - 10 

0.109 

0.106 

0.033 

0.119 

0.100 

0.038 

Basu-Lindsay MT - 15 

0.107 

0.103 

0.030 

0.112 

0.097 

0.033 

Beran RIG - 0.01 

0.077 

0.080 

0.026 

0.066 

0.063 

0.029 

Beran RIG - 0.1 

0.105 

0.104 

0.025 

0.112 

0.108 

0.038 

Beran RIG - SJ 

0.157 

0.032 

0.032 

0.193 

0.180 

0.053 

Beran MT - 5 

0.182 

0.183 

0.025 

0.207 

0.202 

0.032 

Beran MT - 10 

0.127 

0.127 

0.028 

0.153 

0.146 

0.037 

Beran MT - 15 

0.102 

0.104 

0.029 

0.126 

0.121 

0.036 

Di^DE 

0.060 

0.057 

0.024 

0.091 

0.088 

0.027 

MDP 0.75 

0.103 

0.083 

0.067 

0.097 

0.083 

0.065 

MDP 0.5 

0.068 

0.067 

0.029 

0.069 

0.067 

0.028 

MDP 0.25 

0.062 

0.058 

0.026 

0.064 

0.062 

0.029 

MDP 0.1 

0.061 

0.059 

0.024 

0.076 

0.072 

0.027 

MLE 

0.060 

0.056 

0.024 

0.096 

0.094 

0.024 


Table 13: The mean value with the standard deviation of the TVA committed in a 100-run experiment on a two- 
component Weibull mixture (A = 0.35, = 0.5, 1^2 = 3). The escort parameter of the D(/?DE is taken as the new 

MD(/?DE with the SJ bandwidth choice. 
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Estimation 

No Outliers 

1 10% Outliers 

method 

A 

sd(A) 


sd(!/i) 

’^2 

Sd(!y2) 

1 ^ 

sd(A) 


Sd(!/l) 

’^2 

Sd(!^2) 

Hellinger 

Classical MD^DE 

0.367 

0.102 

0.550 

0.104 

1.054 

0.194 

0.352 

0.158 

0.273 

0.050 

1.051 

0.407 

New MD^DE - 0.01 

0.445 

0.103 

0.562 

0.135 

1.212 

0.284 

0.409 

0.133 

0.464 

0.156 

1.148 

0.293 

New MD^DE - 0.1 

0.432 

0.101 

0.502 

0.141 

1.139 

0.241 

0.460 

0.210 

0.378 

0.125 

1.114 

0.302 

New MDvjDE - SJ 

0.431 

0.101 

0.485 

0.141 

1.127 

0.244 

0.487 

0.216 

0.356 

0.108 

1.110 

0.309 

New MD^DE MT - 5 

0.350 

0.158 

0.619 

0.134 

1.006 

0.211 

0.436 

0.313 

0.375 

0.121 

1.245 

1.177 

New MD^DE MT - 10 

0.338 

0.148 

0.643 

0.135 

1.019 

0.167 

0.474 

0.322 

0.409 

0.140 

1.150 

0.516 

New MD^DE MT - 15 

0.335 

0.148 

0.658 

0.135 

1.029 

0.161 

0.456 

0.321 

0.411 

0.146 

1.292 

1.689 

Basu-Lindsay MT - 5 

0.392 

0.178 

0.734 

0.122 

1.042 

0.022 

0.351 

0.225 

0.757 

0.177 

1.048 

0.026 

Basu-Lindsay MT - 10 

0.340 

0.149 

0.742 

0.103 

1.037 

0.024 

0.260 

0.175 

0.712 

0.147 

1.039 

0.024 

Basu-Lindsay MT - 15 

0.340 

0.149 

0.742 

0.103 

1.037 

0.024 

0.222 

0.126 

0.696 

0.125 

1.043 

0.016 

Beran - 0.01 

0.370 

0.098 

0.685 

0.091 

1.125 

0.188 

0.381 

0.211 

0.572 

0.183 

1.058 

0.215 

Beran - 0.1 

0.234 

0.093 

0.747 

0.113 

1.028 

0.118 

0.419 

0.372 

0.479 

0.211 

1.181 

0.553 

Beran RIG - SJ 

0.211 

0.185 

0.745 

0.130 

1.034 

0.230 

0.259 

0.331 

0.367 

0.181 

1.105 

0.542 

Beran MT - 5 

0.302 

0.205 

0.584 

0.129 

0.867 

0.120 

0.471 

0.388 

0.376 

0.128 

1.097 

0.738 

Beran MT - 10 

0.327 

0.175 

0.610 

0.132 

0.929 

0.121 

0.490 

0.347 

0.394 

0.131 

1.155 

0.803 

Beran MT - 15 

0.331 

0.165 

0.623 

0.128 

0.962 

0.128 

0.470 

0.340 

0.400 

0.132 

1.174 

0.893 

D(/?DE 

0.371 

0.111 

0.544 

0.100 

1.064 

0.240 

0.473 

0.293 

0.382 

0.175 

1.431 

1.818 

MPD 0.75 

0.494 

0.181 

0.619 

0.089 

1.341 

0.689 

0.505 

0.243 

0.625 

0.087 

1.313 

0.641 

MPD 0.5 

0.413 

0.134 

0.577 

0.101 

1.143 

0.349 

0.412 

0.255 

0.582 

0.101 

1.059 

0.358 

MPD 0.25 

0.366 

0.108 

0.542 

0.110 

1.064 

0.349 

0.554 

0.348 

0.503 

0.117 

1.205 

0.995 

MPD 0.1 

0.368 

0.109 

0.539 

0.106 

1.059 

0.237 

0.451 

0.322 

0.370 

0.111 

1.280 

1.407 

MLE (EM) 

0.372 

0.108 

0.549 

0.100 

1.055 

0.192 

0.417 

0.194 

0.291 

0.073 

1.114 

0.468 


Table 14: The mean value and the standard deviation of the estimates in a 100-run experiment in a two-component 
Weibull mixture (A = 0.35, = 0.5, ^2 = 1)- The escort parameter of the D(/?DE is taken as the new MD(pDE with 

the Silverman bandwidth choice. 


Estimation 

No Outliers 

10% Outliers 

method 

mean 

median 

sd 

mean 

median 

sd 

Hellinger 

Classical MDt^DE 

0.056 

0.055 

0.026 

0.124 

0.114 

0.035 

New MD(/?DE RIG - 0.01 

0.079 

0.073 

0.039 

0.090 

0.082 

0.044 

New MD(/?DE RIG - 0.1 

0.079 

0.065 

0.059 

0.112 

0.101 

0.050 

New MD(/?DE RIG - SJ 

0.076 

0.065 

0.041 

0.129 

0.117 

0.065 

New MD(/?DE MT - 5 

0.063 

0.058 

0.029 

0.114 

0.095 

0.041 

New MD(/?DE MT - 10 

0.067 

0.063 

0.028 

0.112 

0.102 

0.038 

New MD(/?DE MT - 15 

0.069 

0.067 

0.028 

0.111 

0.105 

0.036 

Basu-Lindsay MT - 5 

0.095 

0.067 

0.078 

0.118 

0.087 

0.088 

Basu-Lindsay MT - 10 

0.094 

0.074 

0.073 

0.112 

0.088 

0.080 

Basu-Lindsay MT - 15 

0.093 

0.072 

0.067 

0.103 

0.088 

0.063 

Beran RIG 0.01 

0.079 

0.081 

0.028 

0.089 

0.087 

0.033 

Beran RIG 0.1 

0.087 

0.085 

0.023 

0.103 

0.102 

0.025 

Beran RIG - SJ 

0.094 

0.092 

0.023 

0.100 

0.097 

0.021 

Beran MT - 5 

0.061 

0.060 

0.022 

0.127 

0.134 

0.044 

Beran MT - 10 

0.059 

0.055 

0.025 

0.115 

0.096 

0.041 

Beran MT - 15 

0.060 

0.056 

0.025 

0.112 

0.097 

0.039 

Di^DE 

0.057 

0.055 

0.028 

0.117 

0.113 

0.034 

MPD 0.75 

0.102 

0.091 

0.050 

0.093 

0.088 

0.039 

MPD 0.5 

0.072 

0.067 

0.032 

0.075 

0.074 

0.033 

MPD 0.25 

0.061 

0.056 

0.028 

0.092 

0.090 

0.039 

MPD 0.1 

0.058 

0.055 

0.027 

0.108 

0.087 

0.039 

MLE 

0.056 

0.055 

0.026 

0.122 

0.117 

0.029 


Table 15: The mean value with the standard deviation of errors committed in a 100-run experiment on a two- 
component Weibull mixture (A = 0.35, = 0.5, = !)• The escort parameter of the D(/?DE is taken as the new 

MD(/?DE with the SJ bandwidth choice. 





REFERENCES 


39 


7.5 Concluding remarks and comments 

Simulation results, although do not cover a wide range of models and divergences, give some indications about 
the robustness and the efficiency of the compared results. They also present possible solutions for many difficult 
estimation problems by employing non classical kernel methods. We summarize the most important remarks based 
on our simulations presented in this last section. 

• Both MLE and classical MDt^DE have the best efficiency under the model even in difficult models with heavy 
tails where kernel-based approaches could not give a satisfactory result. In regular situations such as the gaus- 
sian model (mixtures included), all methods were equivalent under the model. 

• When contamination is present, the compared estimators gave results as expected. Both MLE and classical 
MD(^DE are not robust against contamination. The Dt^DE guided by our kernel-based MD(/jDE gave very 
good results under the model, however, when contamination is present it failed to improve the result obtained 
by the escort in difficult situations which is the case of the three Weibull mixtures and the GPD. It even gave 
very bad results some times in comparison to other estimation methods, but still better than MLE and the 
classical MDtpDE. 

• The Basu-Lindsay approach worked very well in regular situations and even showed improved efficiency in com¬ 
parison to the Beran’s method which is concordant to the result of Basu and Sarkar [1994]. It gave surprisingly 
good results in the GPD model under contamination when we used the varying KDE in comparison to the 
situation under the model. Unfortunately, it did not give satisfactory results in any of the Weibull mixtures. 
This method seems very sensitive to the kernel under difficult situations since the model is already influenced 
by the kernel creating a loss of information. 

• The minimum density power divergence gave very good results in all situations but the GPD. The best tradeoff 
parameter from our set of candidates was a = 0.5. 

• The Beran’s method gave very good tradeoff (and many times the best) between robustness and efficiency in 
most of the situations, but not very well in the GPD model. The best choice of the kernel for GPD and Weibull 
mixtures was the RIG with window 0.01. It was sensitive to the choice of the kernel and its window in many 
situations. 

• Our kernel-based MD(^DE gave very good results in all situations and had close results to the MPD and 
Beran’s methods. It gave the best results in the GPD model with very good compromise between efficiency 
and robustness. It is worth noting that our new estimator was less influenced by the choice of the kernel and 
the window than all kernel-based methods which participated in the comparison showing very promising and 
encouraging properties. 

• The use of the varying KDE (MT) gave best results under the model. We believe that one can get better 
results if we have a better method for choosing the window than the cross-validation procedure presented in 
Mnatsakanov and Sarkisian [2012]. Recall that the cross-validation method gave good results under the model 
but very bad ones under contamination. It chose a value o = 1 for most of the samples. 


• We are surprised that the best window that corresponds to the best performance for asymmetric kernels and 
the varying KDE was the most extreme one (the least for asymmetric kernels and the largest for MT). Such 
a choice corresponds to a fluctuating nonparameteric density estimator. Apparently, the bias at the border 
played the most important part in estimation. Note that for the RIG kernel as the window becomes smaller, 
the estimator goes faster towards infinity at zero. 
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